AN0149012901;9eb01mar.21;2021Mar04.02:01;v2.2.500
The time it takes to reveal embarrassing information in a mobile phone survey
To explore socially desirable responding in telephone surveys, this study examines response latencies in answers to 27 questions in a corpus of 319 audio-recorded voice interviews on iPhones. Response latencies were compared when respondents (a) answered questions on sensitive vs. nonsensitive topics (as classified by online raters); (b) produced more vs. less socially desirable answers; and (c) were interviewed by a professional interviewer or an automated system. Respondents answered questions on sensitive topics more quickly than on nonsensitive topics, though patterns varied by question format (categorical, numerical, ordinal). Independent of question sensitivity, respondents gave less socially desirable answers more quickly when answering categorical and ordinal questions but more slowly when answering numeric questions. Respondents were particularly quicker to answer sensitive questions when asked by interviewers than by the automated system. Findings demonstrate that response times can be (differently) revealing about question and response sensitivity in a telephone survey.
Keywords: Sensitive questions; sensitive responses; social desirability; telephone survey; self-administered; response time; IVR; interviews
Introduction
People answering survey questions are more likely to misreport – to provide answers that are not strictly accurate – when the questions are about sensitive topics, and when the answers could be considered embarrassing. Respondents can underreport behaviors they see as socially undesirable, for example, abortion, declaring bankruptcy and drunk driving (e.g., Fu et al., [17]; Locander et al., [27]), and they can overreport socially desirable behaviors like voting, attending religious services, and having a library card (e.g., Belli et al., [5]; Brenner & DeLamater, [7]; Hadaway et al., [20]; Locander et al., [27]; Presser & Stinson, [31]). The evidence suggests that greater privacy – less social influence from an interviewer, as in self-administered and automated survey modes – can lead to greater reporting of socially undesirable information (e.g., Corkrey & Parkinson, [12]; Tourangeau & Smith, [40]; Turner et al., [41]). The evidence also suggests that these more private modes lead to more accurate answers when compared with validating information (e.g., Kreuter et al., [25]).
The current study investigates how people answer questions on sensitive topics ('sensitive questions') in human-administered and automated telephone interviews, as part of an effort to better understand the dynamics of socially desirable responding. In this case, the interviews were on mobile devices; although we cannot address the extent to which our findings depend on the mobile context, mobile responding is an important part of the landscape of modern surveys, with particular features: the potential for unstable connectivity and changing environments with different levels of ambient noise, the possibility for new kinds of multitasking while answering, and the fact that mobile respondents may well be among other people – friends or strangers – while answering. How exactly the possibility for distracted responding and feeling less private while talking might affect responding to sensitive questions, and whether the patterns are the same in interviewer-administered and automated (self-administered) interviews, is currently unknown.
The particular focus in this study is on mobile respondents' response latency – delays until answering – as they answer sensitive and nonsensitive questions on their iPhones. Response latency (pauses) can be seen as a window into the response process, as one kind of nonverbal 'paradata' produced along with survey responses; other examples include disfluencies like um and uh, and hedges like 'maybe' or 'sort of' (e.g., Schober & Bloom, [34]). These kinds of paralinguistic paradata have been demonstrated to correlate with response difficulty in methodological studies of surveys about nonsensitive facts and behaviors. For example, in one study landline telephone respondents to an automated speech system delayed longer when circumstances they were answering about did not correspond with question concepts straightforwardly (Ehlen et al., [15]). In another, face to face and landline telephone respondents were more likely to produce disfluent answers and (in the face to face interviews) avert their gaze for answers that turned out to be unreliable (Schober et al., [36]).
Response latency has also been theorized to connect systematically with socially desirable responding, with a range of possible automatic and deliberate effects depending on which underlying mechanisms are involved (Andersen & Mayerl, [1], [2]; Holtgraves, [23]). A response-editing possibility predicts that respondents will take longer to answer questions when under greater pressure to give a socially desirable response because evaluating the extent to which an answer fits social norms takes time. Biased- or bypassed-retrieval possibilities predict that respondents will only recall socially desirable information that is consistent with their self-concept (perhaps even as an act of self-deception), in which case they may answer more quickly in situations that promote social desirability. To test these alternatives, Holtgraves ([23]) asked participants to rate, on a computer, the extent to which a set of traits (e.g., cold, reserved, thrifty) and behaviors (e.g., cheating on a test, jaywalking, donating to charity) applied to them; some participants were put under greater social desirability pressure by being told their responses would be used to create an 'adjustment profile,' and others were put under less social desirability pressure by being told their responses were entirely anonymous. The finding was that participants under greater social desirability pressure took longer to respond, consistent with the response-editing possibility. Response latency in such situations may depend on whether questions are about socially desirable or undesirable characteristics; pre-service teachers in Germany answering web-based questions about their suitability for their chosen profession on a tablet gave socially desirable responses more quickly for desirable traits, but more slowly for undesirable traits (Andersen & Mayerl, [1], [2]).
A more survey-centered theory on self-report for intrusive or sensitive behavioral questions (Schaeffer, [33]) focuses on recall processes: how important a behavior is to a respondent, how frequently they engage in it, social reinforcements for it, and how distinct it is from other behaviors. In general, behaviors that are salient to a respondent and simple in structure should be easier and therefore faster to report. On the other hand, if a respondent feels that a question is intrusive and the answer isn't any of the interviewer's business, then denial of a socially undesirable behavior may be an automatic and immediate defense strategy. In this case, respondents should report socially desirable behaviors more quickly than socially undesirable ones. Along these same lines, Schaeffer ([33]) proposed that because reports of sensitive information to a self-administered or computerized interviewer are more private, there should be less response bias or editing, in which case respondents should report sensitive information more quickly to an automated system than to a human interviewer.
Of course, there are many variables that can affect response time to survey questions other than social desirability. Respondents take longer to respond to questions that are long (Dunn et al., [14]; Hanley, [21]; Holden et al., [22]), poorly worded (Bassili & Scott, [4]), with complex sentence structure (Dunn et al., [14]), or with a high number of response options (Olson & Smyth, [29]; Yan & Tourangeau, [42]). Respondents tend to take longer to answer questions at the beginning of an interview than towards the end (Stout, [38]; Yan & Tourangeau, [42]), and respondents can take longer answering questions for which they are uncertain about the answers (Draisma & Dijkstra, [13]). More generally, respondents take longer to respond to questions that pose more complex recall and formulation tasks. Distinguishing the specific effects of question or response sensitivity on response latency from these other potential causes requires setting up 'fair' comparisons that rule out alternative explanations – comparing response latency for sensitive and nonsensitive questions that are comparable on other dimensions likely to affect response time.
To this end, the current study makes use of a corpus of 319 telephone interviews on iPhones that, because of the nature of the study design, allows some of these additional systematic comparisons. In particular, the corpus allows between-subject comparisons, in a sample larger than the usual laboratory sample, of response latencies to (a) sensitive and nonsensitive questions with comparable recall periods (e.g., in the past month, the past year, over a lifetime) and comparable response formats (numerical, ordinal, categorical); (b) response options with a range of social desirability for each question examined (including nonsensitive questions); and (c) answers to identically worded sensitive survey questions asked by an automated system or human interviewer. The corpus also allows within-subject comparison of response latencies by the same person to sensitive and nonsensitive questions. The fact that survey respondents in these interviews were randomly assigned to interviewing modes (interviewer-administered vs. automated) after volunteering to participate in a study on their iPhone suggests that motivation to participate in the interviews was unlikely to have differed across the groups. And the fact that the groups did not differ in any demographic characteristics makes it unlikely that social norms or cognitive abilities would particularly have differed across the modes, and so comparing the timing profile of responses to human and automated interviews is likely to be meaningful.
The nature of the corpus thus allows us to address the following research questions in a focused way. In a smartphone survey where respondents might be around other people, mobile, multitasking, or distracted:
RQ1. Do respondents take different amounts of time to respond to sensitive than nonsensitive questions, independent of whether their responses are embarrassing or not? And does this differ depending on the question's response format (numerical, ordinal, categorical)?
RQ2. Do respondents take different amounts of time to give socially desirable than undesirable responses, independent of whether the question is sensitive? And does this differ depending on the response format?
RQ3. Do respondents take different amounts of time to answer sensitive questions or give embarrassing answers to a human vs. an automated interviewer?
Method
The corpus of data consisted of audiorecordings of the 319 telephone interviews in Schober et al.'s ([35]) study, which compared data quality in text message and voice interviews administered by human and automated interviews (see Conrad & Schober, [11], for the full data set of responses in that study). A total of 310 recordings had sufficient audio quality for analysis.
Respondents
US-based respondents were recruited between March and May 2012 from four online sources – Google AdWords, Amazon Mechanical Turk, Facebook Ads, and Craigslist – to participate in a survey on their iPhone. From a link in each recruitment source, potential respondents were taken to a browser-based screener questionnaire in which they explicitly consented to participate in a research study by clicking a check box, after answering questions about their zip code, date of birth, education, income, gender, ethnicity, race, telephone number service provider, voice minutes and text message plan, whether they were the sole user of this phone, and time zone. A total of 634 respondents met the criteria to participate (21 years of age and up, US area code and iPhone user – verified by text message) and completed the study, including a post-interview exit questionnaire. The 634 respondents were randomly assigned to one of four modes of interviewing: human-interviewer voice, automated-system voice, human-interviewer text (human interviewers used customized desktop software to send standardized text message questions and probes, and individualized text messages, to respondents on their iPhones, and interviewers manually recorded the answers), and automated-system text (an automated system texted standardized questions and probes to respondents on their iPhones and automatically recorded answers). The current study used the data from two of the modes of interviewing: Human Voice (n = 160) and Automated Voice (n = 159). The 310 respondents in this subsample with audio recordings suitable for analysis ranged from 21 to 81 years of age, with a broad range of education levels, income, gender, and self-reported race groupings (see Online Supplement, Table 1, for details); these characteristics did not differ significantly between the Human Voice and Automated Voice groups. See Antoun et al. ([3]) for discussion of how the different recruitment sources affected sample composition.
Measures
Interview
This study carries out analyses on a subset of 27 of the 32 questions from the Schober et al. ([35]) data set. The 32 questions, selected from US social surveys and methodological studies, included 15 questions that allowed measurement of conscientious responding (i.e. number of movies seen last month, number of visits to the grocery store, etc.), 15 questions on sensitive behaviors (i.e. frequency of sex, illegal drug use, newspaper reading, etc.), and 2 questions that were repeats of previous questions. This study excluded the 2 repeated questions and 3 additional questions that focused on iPhone metrics rather than the respondent's past behaviors.
The set of 27 questions ranged in response format, with categorical (yes/no, n = 2, and multiple response options, n = 2), numerical (n = 11), and ordinal (rating scales like 'every day', 'a few times per week', 'once a week', etc., n = 12) formats. The temporal frame (reference period) for the questions varied from the present (e.g., 'How often do you now smoke cigarettes: "every day", "some days" or "not at all"?', n = 13) to the past 30 days (e.g., 'During the past 30 days, on how many days did you drink one or more drinks of an alcoholic beverage?', n = 6), to the past 12 months (e.g., 'How many sex partners have you had in the last 12 months?', n = 4), to time since the respondent's 18th birthday/entire life (e.g., 'Have you smoked at least 100 cigarettes in your entire life?', n = 4). See Online Supplement, Table 2 for the full list of items.
In the Automated Voice condition, respondents were asked each question by the speech-IVR (Interactive Voice Response) system, which was programmed to recognize which of the response options, if any, was matched by respondents' spoken answers. The IVR system implemented a standardized interview with audio recordings of the questions and prompts by one female interviewer, and a speech recognition system (AT&T's Watson) that listened for potential responses to fit a pre-specified grammar; the interview was clearly an automated interview from the start, with a recorded voice asking if the respondent was in a place where it was safe to talk, identifying the study and presenting instructions for continuing. These included instructions that the respondents could say 'skip' if they intended to refuse a particular response, and to say 'help' if they wanted to request clarification. The system was designed such that if a respondent did not answer, or gave a response that the system couldn't recognize, the system would re-ask the question up to a total of 3 times per question. If a respondent gave an answer that the system recognized with limited confidence, the system would ask the respondent to confirm their choice (i.e. 'I think you said "every day." Is that right? Yes or no.'). See Johnston et al. ([24]) for more details about the implementation of the automated system.
In the Human Voice condition, 8 interviewers at the University of Michigan Survey Research Center carried out standardized interviews asking the same 32 questions. The content of the questions and the sanctioned neutral probes was exactly the same as in the Automated Voice condition, but interviewers could use their discretion to manage the interaction, for example, choosing whether and when to provide confirmations that an answer had been recognized ('okay,' 'thanks,' 'got it'), as opposed to the Automated Voice procedure of always providing such confirmation when a system-determined high confidence response was recorded. Human interviewers could deviate from the standardized script to respond to casual commentary from the respondent if needed. The interviewers' accuracy in presenting the questions exactly as worded was extremely high (97.6% for a subset of 100 question-answer sequences coded by multiple raters, with high interrater reliability [Cohen's kappas >.90].[1] Human interviewers were trained to provide the same scripted clarification as the automated system if a respondent asked for it, but they could also use their own words to manage the conversation as needed.
Debriefing questionnaire
After participants completed the interview, they were sent a text message with a link to a web-based post-interview debriefing questionnaire. This questionnaire asked participants follow-up questions about the following topics: presence of others during the interview, environmental conditions (i.e. location of interview), multi-tasking, intrusiveness of questions, mode preference (for upcoming interviews), ease of use, and satisfaction with the interview.
Measuring response latency
The Human Voice audio files consisted of 160 recordings which spanned the entire length of each interview from start to finish, capturing all interviewer and respondent utterances; 152 recordings had sufficient quality for audio analysis. For the Automated Voice condition, there were 8228 audio files captured by the speech recognition software, each containing one caller utterance per item question. The audio software Praat (Boersma & Weenink, [6]) was used to calculate response latencies by measuring the time elapsed from the end (offset) of each question to the beginning (onset) of the respondent's answer. The onset of the respondents' answers was measured in two ways: (1) the time until the first spoken sound (including any disfluencies, restarts, laughter, etc.) and (2) the time until respondents gave their actual answer.[2]
Determining sensitivity and social desirability
To establish which questions within this set were indeed sensitive, and which responses were socially desirable and undesirable, a set of ratings was collected from an online sample of participants (n = 102) recruited by Qualtrics (see Feuer et al., [16], for full details).[3] One hundred and two respondents (age 21 and up) were recruited from across the United States in October 2017 to match the respondent demographics of the Schober et al. ([35]) study as closely as possible on gender, age and highest level of education completed. The online survey, which included all 27 questions and responses from the Schober et al. ([35]) study, asked raters to rate how embarrassed they thought most people would be to a) answer each question and b) give each response option for each question. (The raters did not provide their actual answer to the question, only their rating of how embarrassed most people would be to answer the question or give the response option.) For each question and response options, raters were asked to judge whether most people would be 'not at all embarrassed', 'somewhat embarrassed', or 'very embarrassed.'[4] For the questions with numerical answers, the response options for this questionnaire were based on the distribution of numerical responses in Schober et al. ([35]); the value 0 was always included as one option, and non-zero responses were grouped into reasonable increments based on the range in the question (i.e. 5-day increments for reporting days drinking in the past 30 days).
Using these scores, a sensitivity rating was calculated for each question by determining the percentage of the panelists who thought most people would be 'somewhat embarrassed' or 'very embarrassed' to simply be asked the question. For the 27 questions, the ratings ranged from 9.8% sensitivity (frequency of eating spicy food) to 79.4% sensitivity (frequency of having sex). Figure 1 shows the range of sensitivity ratings for each question, in the order of the most to least sensitive. For purposes of analysis for this study, 9 questions were classified as sensitive based on the threshold that 50% of the Qualtrics-recruited panelists judged that most people would find the question sensitive, and 18 questions were classified as nonsensitive.[5]
PHOTO (COLOR): Figure 1. Percentage of online panelists who rated each question as embarrassing to answer, in the order from most to least sensitive
Panelists' embarrassment ratings were also used to classify each response option as socially desirable (nonsensitive) or socially undesirable (sensitive). Totaling the percentage of the panelists who said most people would be 'somewhat embarrassed' or 'very embarrassed' to give each response, ratings ranged from 7.8% socially undesirable (to report grocery shopping 1–15 times a month) to 88.2% socially undesirable (for a female to report having 21–100 female sex partners since her 18th birthday). Using a 50% threshold (50% of panelists judging that a particular response was sensitive), questions ranged substantially in how many response options were socially undesirable. For some questions (e.g., eating spicy food, or avoiding fast food) no response options were considered sensitive. For others (e.g., sex partners in the last 12 months), most or all response options were sensitive. Sensitive questions could have nonsensitive responses, and nonsensitive questions could have embarrassing response options (e.g., eating in a restaurant or going to the movies too often). See Online supplement Table 2 for the embarrassment ratings for all response options.
Results
To address the three research questions, we fitted a series of multilevel regression models to simultaneously evaluate the fixed effects of question or response sensitivity, response type (ordinal, numeric, categorical), and interview mode (interviewer-administered vs. automated) on response times, along with potential interactions between question or response sensitivity and response type as well as interview mode. To account for within-respondent correlations, given the repeated measures design of the study, we included random effects of specific respondents in the models. To account for potential effects of additional question properties beyond response type, we included additional fixed effects of question length (number of words in each question), question position in the questionnaire (first third, middle third, last third), and recall period (none/the present, past day, past 30 days, past 12 months, or since 18th birthday).
We fitted separate models to natural log transformations of both response time measures – time until first sound and time until first response – to account for the particular right-skewness of response times in this dataset; models were estimated using restricted maximum likelihood estimation in Stata/SE version 15.1 (using the 'mixed' command). The natural log transformations resulted in the best fitting linear models relative to other log transformations and a square root transformation, but the linear models still produced residuals that were not normally distributed. Because of these violations of the normal residuals assumption, additional multilevel ordinal regression models were fitted as a robustness check (using Stata's 'meologit' command) on quintile-transformed ordinal versions of the two response time dependent variables, simultaneously evaluating exactly the same effects and interactions as in the linear models. Because the pattern of findings for these ordinal models was essentially the same for all predictors and interactions, we are confident in the robustness of the linear model findings and report those here. (See Supplementary Material Tables 3 and 4 for findings from the ordinal models).
Tables 1 and 2 report results and model summaries for the dependent variable 'time until first response.' (The patterns of results for the dependent variable 'time until sound' are almost identical; the few cases of different patterns are reported below). The variables added into the models as plausible contributors to response time beyond those investigated in the research questions all proved to have significant effects on response time. Respondents took longer to give answers to questions with numeric and ordinal response formats than questions with categorical responses. They answered longer questions (those with more words in them) more quickly, which is plausible given that longer question wordings may give respondents more time to think about their answers during the delivery of the questions. Questions that came later in the survey took longer to respond to, which presumably reflects the difficulty in response tasks engendered by the different questions (as opposed to the more frequently observed pattern that respondents tend to answer more quickly as surveys proceed, e.g., Yan & Tourangeau, [42]). Questions with all recall periods took longer to answer than questions with no recall period. The random effect parameter of respondent showed significant effects, which is unsurprising given the extent to which individuals can stably vary in their response speeds. In any case, the point is that significant findings on the research questions examined below demonstrate effects after controlling for these potentially confounding factors.
Table 1. Multilevel regression modeling effects of question sensitivity on natural log transformation (ln) of 'time until first response.' Significance tests are two-tailed. *p < 0.05, **p < 0.01, ***p < 0.001
| Predictor | Reference category | Coefficient | (SE) |
| Interview mode | Automated | .01 | (.03) |
| Response type | Numeric | .55*** | (.06) |
| Ordinal | .37*** | (.03) | |
| Question sensitivity | Sensitive | −.28*** | (.04) |
| Interview mode x question sensitivity | Automated Sensitive | .17*** | (.02) |
| Question sensitivity x response type | Sensitive Numeric | −.23*** | (.05) |
| Sensitive Ordinal | −.06 | (.05) | |
| Question length | −.00* | (.00) | |
| Question placement | .04** | (.01) | |
| Recall period | Past day | .38*** | (.05) |
| Past month | .36*** | (.04) | |
| Past year | .19*** | (.03) | |
| Since 18th birthday | .23*** | (.05) | |
| Constant | .21*** | (.05) | |
| Random-effect parameter: | Estimate | (SE) | |
| Respondent | .04*** | (.00) |
Table 2. Multilevel regression modeling effects of response sensitivity on natural log transformation (ln) of 'time until first response.' Significance tests are two-tailed. *p < 0.05, **p < 0.01, ***p < 0.001
| Predictor | Reference category | Coefficient | SE |
| Interview mode | Automated | .08** | (.03) |
| Response type | Numeric | .34*** | (.04) |
| Ordinal | .39*** | (.02) | |
| Response sensitivity | Sensitive | .06 | (.04) |
| Interview mode x Response sensitivity | Automated Sensitive | −.04 | (.03) |
| Response sensitivity x response type | Sensitive Numeric | .42*** | (.04) |
| Sensitive Ordinal | −.09* | (.05) | |
| Question length | −.00* | (.01) | |
| Question placement | .16*** | (.01) | |
| Recall period | Past day | .49*** | (.05) |
| Past month | .48*** | (.04) | |
| Past year | .16*** | (.03) | |
| Since 18th birthday | −.12** | (.04) | |
| Constant | −.12*** | (.03) | |
| Random-effect parameter: | Estimate | (SE) | |
| Respondent | .00*** | (.00) |
Research Question 1: Do respondents take different amounts of time to respond to sensitive th...
As Table 1 demonstrates, question sensitivity significantly affected response times (see Online Supplement Table 3 for complementary results of the robustness check using quintile-transformed ordinal models). Respondents answered the sensitive questions significantly more quickly than the nonsensitive questions. So the answer to Research Question 1 is clear: Respondents take longer to respond to nonsensitive questions.
The interaction effects (Table 1) show that the impact of question sensitivity is greater for questions with numeric response formats than for the reference category (questions with categorical response format) and questions with ordinal responses. (The same interaction effect is observed for the dependent variable 'silence to sound' in the linear model, but not in the robustness-check ordinal model.) The pattern of results in Table 3, which presents mean and median (untransformed--and so not reflecting the right-skewness that is accounted for through the log-transformed data analyzed in the multilevel models) times until response for the sensitive and nonsensitive questions, suggests that in this data set it is for the numeric response-format questions that participants responded to the sensitive questions particularly more quickly.
Table 3. Time to respond (seconds) in answers to nonsensitive vs. sensitive questions
| Response type | Nonsensitive | Sensitive | ||||
| Mean | (SD) | Median | Mean | (SD) | Median | |
| Categorical | 1.40 | (0.60) | 1.31 | 1.23 | (0.70) | 1.08 |
| Numeric | 3.73 | (2.11) | 3.30 | 2.31 | (2.07) | 1.81 |
| Ordinal | 2.29 | (1.93) | 2.34 | 1.78 | (1.51) | 1.43 |
Research Question 2: Do respondents take different amounts of time to give socially undesirab...
Using the same modeling approach, this time with the rated sensitivity of the response (rather than of the question) as the predictor, Table 2 shows a different pattern than Table 1. While the sensitivity of responses did not affect response times significantly overall, the interaction effects (response sensitivity x response type) show significant and opposite effects of response sensitivity on response time for different response types. In particular, respondents took longer to give sensitive responses for the numeric questions (relative to the reference category of questions with categorical response options), and they gave sensitive ordinal responses significantly more quickly. (See Online Supplement Table 4 for complementary results of the robustness check using quintile-transformed ordinal models). So the answer to Research Question 2 is that participants did take longer to give socially undesirable numerical responses, but that they gave socially undesirable ordinal responses more quickly.[6]
The (untransformed) mean and median response times in Table 4 give a sense of the size and directionality of these differences, although of course again without reflecting the right-skewness that is accounted for through the log-transformed data analyzed in the multilevel models.
Table 4. Time to respond (seconds) in nonsensitive vs. sensitive responses
| Response type | Nonsensitive | Sensitive | ||||
| Mean | (SD) | Median | Mean | (SD) | Median | |
| Categorical | 1.27 | (0.63) | 1.16 | 1.28 | (0.78) | 1.09 |
| Numeric | 3.13 | (2.03) | 2.70 | 3.56 | (2.07) | 2.65 |
| Ordinal | 2.25 | (1.59) | 1.84 | 1.78 | (1.40) | 1.57 |
Additional exploratory analyses
While the evidence suggests that respondents give embarrassing answers either more quickly (for questions with ordinal responses) or more slowly (for questions with numerical responses), it is important to rule out simpler alternative explanations that could account for effects of response sensitivity before concluding this definitively. We therefore performed two additional exploratory analyses.
Speed of answering 'no' or 'zero'? As seen in Online supplement Table 2, a number of the socially desirable responses turn out to be those in which the respondent has nothing to report, e.g., drinking 0 days, or reporting 'not at all' for sex frequency. This raises the question of whether anything about the observed patterns in giving more and less sensitive responses is simply the result of cases where respondents had less to report – because it may be easy and quick to answer 'no' or 'zero,' but it may take time to recall and count (or estimate) non-zero behaviors. This might be particularly likely in responses to questions with a numeric response format in this data set (the questions are largely behavioral frequency questions), and also plausible for ordinal responses to questions about behavioral frequency (though not for the opinion questions).
To examine this, we re-fit the same multilevel regression models examining the effects of response sensitivity but this time removing (filtering out) all such responses from the data: '0', 'not at all,' 'no partners,' and 'never.' The models again simultaneously evaluate the effects of response sensitivity, response type (ordinal, numeric, categorical), and interview mode (interviewer-administered vs. automated) on response times, along with potential interactions between response sensitivity and response type as well as interview mode. We again included random effects of specific respondents and questions in the models, as well as additional fixed effects of question length, question position in the questionnaire, and recall period.
The pattern of effects of response sensitivity interactions – direction of coefficients and significance – is exactly the same in these models (see Online Supplement Table 5 for details). (The effect of interview mode is no longer significant with the time to response measure, but remains significant with the silence to sound measure). Sensitive responses to questions with numeric responses were again significantly slower than nonsensitive responses, and sensitive responses to ordinal questions were significantly quicker than nonsensitive responses. This analysis suggests that the pattern of findings cannot be explained by what might be easily provided and thus speedy 'no' and '0' responses.
Time to recall and report more behaviors? Given that for many of the behavioral frequency questions in this dataset the socially undesirable response options require respondents to think of more instances, it is also worth asking whether longer delays in giving embarrassing responses to numerical questions simply reflect the time it takes to report more behaviors, rather than the sensitive nature of the information being reported. When respondents answer frequency questions by recalling and counting episodes (sometimes called 'enumeration') their response times increase as a linear function of their answer (e.g., Conrad et al., [10]) indicating that each retrieval operation contributes to the total response time. When participants answer by using strategies that are not recall-based, e.g., rate of occurrence or qualitative impressions of frequency like 'pretty often,' their response times are unrelated to the magnitude of their answer presumably because the amount of recall does not increase as the frequency of the event increases (see also Brown, [8]). There is also evidence that males are more inclined than females to report the number of lifetime sex partners using recall-based strategies than females (Brown & Sinclair, [9]), which could further complicate the current results.
This possibility – that longer response times for higher reported frequencies is driven by the amount of recall rather than an editing step before responding – is lent plausibility by the fact that, as Table 5 shows, respondents took longer to give their first response to the six nonsensitive questions requiring numerical answers when their responses were larger (that is, there were moderate but significant positive correlations between response times and number of reported instances).
Table 5. Correlations between number of instances reported and time to first response for the nonsensitive questions requiring a numerical response
| Question | Correlation (Pearson) |
| Q20: Spicy food frequency | r(238) =.146, p =.024 |
| Q23: TV watching (over 5 hours is sensitive) | r(269) =.217, p <.001 |
| Q24: Movie watching | r(282) =.260, p <.001 |
| Q25: Movies in theater | r(282) =.388, p <.001 |
| Q26: Grocery shopping | r(266) =.281, p <.001 |
| Q27: Restaurant dining | r(281) =.135, p =.024 |
To explore this possibility, we fit new models examining the effects of response sensitivity focusing only on the numeric responses. The models simultaneously evaluated the effects of response sensitivity and interview mode (interviewer-administered vs. automated) on response times, along with potential interactions between response sensitivity and interview mode. Along with including random effects of specific respondents and fixed effects of question properties (question length, question position in the questionnaire, and recall period), this time we included the numerosity of the responses as crossed random effects (essentially taking the raw numerical response as a covariate in the model, with the simplifying assumption that larger numbers would take more effort to recall even if the plausible range of responses for different questions may differ).
The model results (see Online Supplement Table 6) clearly demonstrate that respondents took longer to produce sensitive numeric responses even accounting for the numerosity of the answer in this way. So giving a sensitive answer takes longer for numeric questions above and beyond any additional time that recalling more instances may take.
Research Question 3: Does the pattern of responding to sensitive questions or providing sensi...
The evidence in this data set on how interview mode affects response time for sensitive questions or responses shows effects only of question sensitivity, not of response sensitivity. In the models that included question sensitivity (Table 1), with the measure of time until response there was an interaction of interview mode x question sensitivity: respondents in interviewer-administered interviews showed a bigger difference in their time to answer sensitive vs. nonsensitive questions than respondents in automated interviews. (There was no overall effect of interview mode on response time). With the measure of time until producing a sound, in contrast, there was no interaction of interview mode x question sensitivity, although there was a significant effect of interview mode: respondents took longer to produce their first sound with a human than automated interviewer. Table 6 shows the mean and median (untransformed) response times with the two different measures, which again give a sense of the direction and size of effects but do not account for the right-skewness in the response times that the multilevel models do.
Table 6. Time to respond (seconds) in answers to nonsensitive vs. sensitive questions
| Interview mode | Measure | Nonsensitive | Sensitive | ||||
| Mean | (SD) | Median | Mean | (SD) | Median | ||
| Automated | |||||||
| Time until sound | 2.59 | (1.37) | 2.30 | 1.85 | (1.06) | 1.66 | |
| Time until response | 2.64 | (1.41) | 2.35 | 1.86 | (1.07) | 1.67 | |
| Interviewer administered | |||||||
| Time until sound | 1.87 | (1.00) | 1.66 | 1.26 | (0.66) | 1.10 | |
| Time until response | 2.99 | (2.39) | 2.36 | 1.77 | (2.12) | 1.20 |
As for the models that examine effects of response sensitivity, Table 2 shows that there was no interaction between interview mode and response sensitivity. In these models, respondents took longer to provide answers to human than automated interviews, but the effects were the same for sensitive and nonsensitive responses.
These findings are partially consistent with Schaeffer's ([33]) proposal that respondents should report sensitive information more quickly to automated systems than to human interviewers because automated interviews are more private and thus there should be less response bias or editing. The evidence shows that response times to human interviewers were generally slower, and the discrepancy in response times for sensitive and nonsensitive questions was greater, than in automated interviews. But there was no evidence that sensitive responses were reported any differently to the human vs. automated interviewer. These data suggest that being asked a sensitive question by a human or an automated system does indeed lead to different response processes, but it depends particularly on the sensitivity of the question rather than the sensitivity of the response.
Could these differences in response time patterns in human vs. automated interviews be the result of differential nonresponse in the two modes rather than from different response processes? That is, might the automated interviews have attracted or retained quicker respondents than the human interviews, or respondents who were less concerned about presenting themselves in a positive light? While we cannot fully rule out this possibility, we see it as unlikely to be the explanation. Given that respondents in the two modes did not differ on any of the characteristics we measured (age, gender, ethnicity, race, education, income), differences in response speed or style that could be associated with these variables cannot account for the pattern. It is true that a greater percentage (13.37%) of those respondents who started the automated interview did not complete it (broke off) than of those who started the human interview (2.96%), consistent with what is typically found in comparisons of human- and self-administered surveys (see Tourangeau et al., [39], chapter 3). But following the logic for what it would take for differential completion to account for greater disclosure in automated than human-administered interviews in this dataset (Schober et al., [35]), this would require those who broke off in the automated system to primarily be people who would disclose substantially less than those who completed – which is not consistent with the more plausible pattern that those who break off are likely to be those with more to disclose. Nonetheless, to the extent that non-completers in the automated interviews may have been particularly squeamish about being asked sensitive questions, their not being included in the final dataset could contribute to the observed pattern that respondents in the automated interviews showed less of a discrepancy in their time to answer sensitive vs. nonsensitive questions than respondents in human-administered interviews.
Discussion
At the broadest level, the findings demonstrate that patterns of response latency are revealing about question and response sensitivity in telephone interviews (in this data set, mobile telephone interviews), and that question sensitivity and response sensitivity can have different effects. To summarize, the findings demonstrate (Research Question 1) that mobile respondents do indeed take different times to answer when asked sensitive vs. nonsensitive questions, independent of whether their responses are embarrassing, by multiple different measures of latency (raw and standardized times to first speech and to first response): they answer questions on sensitive topics more quickly. The pattern differs for different question types, with more extreme differences for questions requiring numerical responses.
For Research Question 2, for some question formats (ordinal and categorical) respondents gave sensitive responses more quickly, but for questions requiring numerical answers respondents consistently were slower to give embarrassing than non-embarrassing answers – even with questions that weren't rated as sensitive. While our evidence demonstrates that recalling and reporting more instances of a behavior takes longer, this is unlikely to account for all the findings here; socially undesirable responses still took longer even when we omitted easy-to-answer 'no' and 'zero' responses, and the findings show independent contributions of response size and response sensitivity.
The evidence on Research Question 3 showed that by some measures respondents were slower to answer sensitive questions with human interviewers than to an automated system, and their difference in speed in answering sensitive vs. nonsensitive questions was greater with human interviewers than the automated system. But – perhaps surprisingly – this difference only reflects sensitivity of the question topic rather than the sensitivity of the response; we saw no evidence for a different pattern in hesitations in providing sensitive answers to a human interviewer (who presumably might elicit more embarrassment or concern about judgment) than to an automated system. In any case, the findings demonstrate that respondents can treat automated and human interviewers differently even when the modes of administration – a spoken telephone interview with conversational back-and-forth – share many features.
The findings contribute to a growing body of research that examines paralinguistic correlates of survey responding based on close analysis of audio recordings of interviews (e.g., Ehlen et al., [15]; Garbarski et al., [19]; Schaeffer & Maynard, [32]; Schober & Bloom, [34]; Schober et al., [36]), this time with a particular focus on what might distinguish sensitive questions and responses from nonsensitive questions and responses. The evidence is only partially consistent with evidence in other domains that participants under social desirability pressure can take longer to respond in judging whether textually presented items apply to themselves (Holtgraves, [23]) and that they can take longer to answer textual questions about socially undesirable traits (Andersen & Mayerl, [1]). Here telephone survey respondents took longer to give socially undesirable answers only when providing numerical answers to behavioral frequency questions; their socially undesirable answers to questions with ordinal and categorical response formats were actually quicker. What is clear across studies is that multiple factors can contribute to hesitation in spoken responding, including problems mapping the available responses onto the circumstances about which one is answering (Ehlen et al., [15]; Schober & Bloom, [34]) and uncertainty about an answer (Draisma & Dijkstra, [13]; Garbarski et al., [18]; Smith & Clark, [37]). The evidence added from the current findings is that spoken latencies are also correlated with the processes involved in providing embarrassing information.
The findings also contribute to what is known about responding to sensitive questions. Consistent with Schaeffer's ([33]) arguments, question sensitivity is complex, and just because a question is sensitive doesn't mean that all responses to it are sensitive. Nor will every respondent likely find the same questions or responses sensitive (Lind et al., [26]). Our pattern of findings corroborates the complexity of question vs. response sensitivity, this time in a mobile environment – the latency patterns for sensitive questions and sensitive responses are different – and adds evidence that the patterns can be different for questions with different response formats (see also Olson et al., [30]).
Given that we don't know the true values of the responses in this study, the findings only allow speculation about the underlying cognitive and affective processes that result in these patterns. In particular, we can't know for sure whether respondents gave less embarrassing answers more quickly (for numerical questions) because reporting embarrassing information takes extra steps – steeling oneself for potential disapproval, editing and adjusting the response so as to self-present more favorably – or because respondents consider some questions so intrusive that they bypass careful processing and quickly present a socially desirable (and perhaps false) response (Schaeffer, [33]). We also can only speculate on the processes underlying respondents' giving embarrassing answers more quickly for categorical and ordinal questions in this data set, though it is worth noting that the response tasks required by a number of the ordinal questions (providing an opinion from 'strongly favor' to 'strongly oppose') are surely quite different from what is required in retrieving instances for behavioral frequency questions. It is likely that multiple different processes are at play in the current data set, and we certainly cannot assume that all responses were fully accurate. (At least some socially desirable responses may have been misreports, and at least some socially undesirable responses might have been shaded so as to be less than fully truthful).
As we see it, testing the generality of the findings in this study will be an important next step: determining whether the findings extend to respondents recruited in different ways (e.g., probability samples) or responding in other modes (e.g., landline telephones, face to face, web surveys), and to different kinds of questions (e.g., questions on sensitive behavioral frequency topics like voting, where reporting less of the behavior is seen as less socially desirable, and different kinds of opinion questions with and without ordinal response scales). Our sample was not large enough to allow us to examine whether the patterns were very different for the (minority of) respondents who reported having been mobile or multitasking or among others who could overhear answers than those who were stationary, task-focused, or alone – although the fact that the response distributions did not differ for these groups (see Schober et al., [35]) suggests the patterns may hold across these different circumstances.
Clearly, it would be a mistake to argue from our findings that speeded or delayed responding straightforwardly demonstrates that a respondent is embarrassed to give a particular answer for a particular question format. Rather, we see the strength and reliability of the patterns observed here as uncovering a set of phenomena that need to be considered in models of response processes, and whose underlying dynamics deserve further qualitative exploration. To understand the processing involved in reporting socially desirable information, more complex models will be needed that take into account the intrusiveness of the question and the social desirability of the response – distinguishing question and response sensitivity – as well as the cognitive difficulty of recalling the relevant information, and the various affective concerns that may motivate respondents to answer in particular ways: their privacy concerns, judgments of risks vs. gains of disclosing embarrassing information, and their personal need to avoid embarrassment.
Acknowledgments
The authors gratefully acknowledge National Science Foundation grants to authors Schober and Conrad that funded collection of the data set on which these analyses focus (SES-1026225 and SES-1025645); use of the services and facilities of the Population Studies Center at the University of Michigan, funded by NICHD Center Grant P2CHD041028; New School for Social Research faculty research funds to author Schober; and advice from William Hirst, Ai Rene Ong, Paul Schulz, Brady West, and the editors and reviewers.
Disclosure statement
No potential conflict of interest was reported by the authors.
Supplementary material
Supplemental data for this article can be accessed http://dx.doi.org/10.1080/13645579.2020.1824629.
Correction Statement
This article has been republished with minor changes. These changes do not impact the academic content of the article.
Notes
1 Almost all the deviations consisted in omitting the parenthetical text 'including the recent past that you have already told us about' for two questions asking about sexual partners since the respondent's 18th birthday.
2 A transformed set of standardized latency scores was also calculated, following McDaniel and Timm's ([28]) standardization procedure, to control for individual differences in responding. Initial analyses carried out on both raw and standardized latencies produced exactly the same pattern of results, and so only the raw latency scores are reported here.
3 This method of empirically determining question sensitivity goes beyond the less formal method of question selection in Schober et al. ([35]).
4 Note that other criteria could have been included to measure question and response sensitivity, for example, the extent to which raters believed that a particular response could harm someone's reputation, the extent to which a particular response could be legally compromising, the extent to which a respondent might feel shame, etc. Given the burdensome length of the rating task, we selected embarrassment for most people as a plausible proxy that could tap into judgments of group (rather than personal) norms. Whether different criteria (for example, asking about personal norms) would have led to different ratings is, of course, unknown.
5 As detailed in Feuer, Fail, and Schober (2020), different thresholds led to different classifications of question sensitivity. For example, with a 40% threshold 4 more questions (cigarette smoking, drinking alcohol, sexual orientation and television watching) move to being sensitive. Analyses reported here tested our research questions with several different thresholds; the 50% threshold yielded the clearest and most consistent findings, while the 60% threshold classified too few questions as sensitive to allow us to explore our research questions.
6 Using the dependent variable of time until sound rather than time until response, this interaction of response sensitivity and response type is a bit less clear; the interaction is significant in the ordinal model (robustness check) but not significant in the linear model.
References
Andersen, H., & Mayerl, J. (2017). Social desirability and undesirability effects on survey response latencies. Bulletin of Sociological Methodology/Bulletin De Methodologie Sociologique, 135 (1), 68 – 89. https://doi.org/10.1177/0759106317710858
Andersen, H., & Mayerl, J. (2019). Responding to socially desirable and undesirable topics: Different types of response behaviour? Methods, Data, Analyses, 13 (1), 7 – 35. https://doi.org/10.12758/mda.2018.06
Antoun, C., Zhang, C., Conrad, F. G., & Schober, M. F. (2016). Comparisons of online recruitment strategies for convenience samples: Craigslist, Google AdWords, Facebook and Amazon's Mechanical Turk. Field Methods, 28 (3), 231–246. https://doi.org/10.1177/1525822X15603149
Bassili, J. N., & Scott, B. S. (1996). Response latency as a signal to question problems in survey research. The Public Opinion Quarterly, 60 (3), 390 – 399. https://doi.org/10.1086/297760
Belli, R. F., Traugott, M. W., & Beckmann, M. N. (2001). What leads to voting overreports? Contrasts of overreporters to validated voters and admitted nonvoters in the American National Election Studies. Journal of Official Statistics, 17 (4), 479 – 498. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/what-leads-to-voting-overreports-contrasts-of-overreporters-to-validated-voters-and-admitted-nonvoters-in-the-american-national-election-studies.pdf
Boersma, P., & Weenink, D. (2016). Praat: Doing phonetics by computer [Computer program]. Version 6.0.14.
7 Brenner, P. S., & DeLamater, J. (2016). Lies, damned lies, and survey self-reports? Identity as a cause of measurement bias. Social Psychology Quarterly, 79 (4), 333 – 354. https://doi.org/10.1177/0190272516628298
8 Brown, N. R. (1995). Estimation strategies and the judgment of event frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21 (6), 1539 – 1553. https://doi.org/http://doi.10.1037/0278-7393.21.6.1539
9 Brown, N. R., & Sinclair, R. C. (1999). Estimating number of lifetime sexual partners: Men and women do it differently. Journal of Sex Research, 36 (3), 292 – 297. https://doi.org/10.1080/00224499909551999
Conrad, F. G., Brown, N. R., & Cashman, E. R. (1998). Strategies for estimating behavioural frequency in survey interviews. Memory, 6 (4), 339 – 366. https://doi.org/10.1080/741942603
Conrad, F. G., & Schober, M. F. (2015). Precision and Disclosure in Text and Voice Interviews on Smartphones: 2012 [United States]. https://www.openicpsr.org/openicpsr/project/100113/version/V2/view;jsessionid=C2060199491E03D464D206A89DF5FA66
Corkrey, R., & Parkinson, L. (2002). A comparison of four computer-based telephone interviewing methods: Getting answers to sensitive questions. Behavior Research Methods, Instruments, & Computers, 34 (3), 354 – 363. https://doi.org/10.3758/BF03195463
Draisma, S., & Dijkstra, W. (2004). Response latency and (para)linguistic expressions as indicators of response errors. In S. Presser, J. M. Rothgeb, M. P. Couper, J. T. Lessler, E. Martin, J. Martin, & E. Singer (Eds.), Methods for testing and evaluating survey questionnaires (pp. 131 – 147). John Wiley & Sons, Inc.
Dunn, T. G., Lushene, R. E., & O'Neil, H. F. (1972). Complete automation of the MMPI and a study of its response latencies. Journal of Consulting and Clinical Psychology, 39 (3), 381 – 387. https://doi.org/10.1037/h0033855
Ehlen, P., Schober, M. F., & Conrad, F. G. (2007). Modeling speech disfluency to predict conceptual misalignment in speech survey interfaces. Discourse Processes, 44 (3), 245 – 265. https://doi.org/10.1080/01638530701600839
Feuer, S., Fail, S., & Schober, M. F. (2019). Empirically assessing survey question and response sensitivity. [Manuscript in Preparation].
Fu, H., Darroch, J., Henshaw, S., & Kolb, E. (1998). Measuring the extent of abortion underreporting in the 1995 National Survey of Family Growth. Family Planning Perspectives, 30 (3), 128 – 133, 138. https://doi.org/10.2307/2991627
Garbarski, D., Schaeffer, N. C., & Dykema, J. (2011). Are interactional behaviors exhibited when the self-reported health question is asked associated with health status? Social Science Research, 40 (4), 1025 – 1036. https://doi.org/10.1016/j.ssresearch.2011.04.002
Garbarski, D., Schaeffer, N. C., & Dykema, J. (2016). Interviewing practices, conversational practices, and rapport: Responsiveness and engagement in the standardized survey interview. Sociological Methodology, 46 (1), 1 – 38. https://doi.org/10.1177/0081175016637890
Hadaway, C. K., Marler, P. L., & Chaves, M. (1993). What the polls don't show: A closer look at U.S. church attendance. American Sociological Review, 58 (6), 741. https://doi.org/10.2307/2095948
Hanley, C. (1962). The "difficulty" of a personality inventory item. Educational and Psychological Measurement, 22 (3), 577 – 584. https://doi.org/10.1177/001316446202200316
Holden, R. R., Fekken, G. C., & Jackson, D. N. (1985). Structured personality test item characteristics and validity. Journal of Research in Personality, 19 (4), 386 – 394. https://doi.org/10.1016/0092-6566(85)90007-8
Holtgraves, T. (2004). Social desirability and self-reports: Testing models of socially desirable responding. Personality and Social Psychology Bulletin, 30 (2), 161 – 172. https://doi.org/10.1177/0146167203259930
Johnston, M., Ehlen, P., Conrad, F. G., Schober, M. F., Antoun, C., Fail, S., Hupp A., Vickers L., Yan H., & Zhang, C. (2013). Spoken dialog systems for automated survey interviewing. In Proceedings of the 14th Annual SIGDIAL Meeting on Discourse and Dialogue (SIGDIAL 2013) (pp. 329 – 333). http://www.aclweb.org/anthology/W13-4050
Kreuter, F., Presser, S., & Tourangeau, R. (2008). Social desirability bias in CATI, IVR, and web surveys: The effects of mode and question sensitivity. Public Opinion Quarterly, 72 (5), 847 – 865. https://doi.org/10.1093/poq/nfn063
Lind, L. H., Schober, M. F., Conrad, F. G., & Reichert, H. (2013). Why do survey respondents disclose more when computers ask the questions? Public Opinion Quarterly, 77 (4), 888 – 935. https://doi.org/http://doi.10.1093/poq/nft038
Locander, W., Sudman, S., & Bradburn, N. (1976). An investigation of interview method, threat and response distortion. Journal of the American Statistical Association, 71 (354), 269 – 275. https://doi.org/10.1080/01621459.1976.10480332
McDaniel, M. A., & Timm, H. W. (1990). Lying takes time: Predicting deception in biodata using response latency. In 98th Annual Convention of the American Psychological Association, Boston. http://www.people.vcu.edu/~mamcdani/Publications/McDaniel%20&%20Timm%20(1990)%20Lying%20takes%20time.pdf
Olson, K., & Smyth, J. D. (2015). The effect of CATI questions, respondents, and interviewers on response time. Journal of Survey Statistics and Methodology, 3 (3), 361 – 396. https://doi.org/10.1093/jssam/smv021
Olson, K., Smyth, J. D., & Ganshert, A. (2018). The effects of respondent and question characteristics on respondent answering behaviors in telephone interviews. Journal of Survey Statistics and Methodology, 7 (2), 275–308. https://doi.org/http://doi.10.1093/jssam/smy006
Presser, S., & Stinson, L. (1998). Data collection mode and social desirability bias in self-reported religious attendance. American Sociological Review, 63 (1), 137 – 145. https://doi.org/http://doi.10.2307/2657486
Schaeffer, N. C., & Maynard, D. W. (1996). From paradigm to protoype and back again: Interactive aspects of cognitive processing in survey interviews. In N. Schwarz & S. Sudman (Eds.), Answering questions: Methodology for determining cognitive and communicative processes in survey interviews (pp. 65 – 88). Jossey-Bass.
Schaeffer, N. C. (2000). Asking questions about threatening topics: A selective overview. In A. A. Stone, J. S. Turkkan, C. A. Bachrach, J. B. Jobe, H. S. Kurtzman, & V. S. Cain (Eds.), The science of self-report: Implications for research and practice (pp. 105 – 121). Lawrence Erlbaum Associates.
Schober, M. F., & Bloom, J. E. (2004). Discourse cues that respondents have misunderstood survey questions. Discourse Processes, 38 (3), 287 – 308. https://doi.org/10.1207/s15326950dp3803_1
Schober, M. F., Conrad, F. G., Antoun, C., Ehlen, P., Fail, S., Hupp, A. L., Zhang, C., Yan, H. Y., Zhang, C., & Johnston, M. (2015). Precision and disclosure in text and voice interviews on smartphones. Plos One, 10 (6), e0128337. https://doi.org/10.1371/journal.pone.0128337
Schober, M. F., Conrad, F. G., Dijkstra, W., & Ongena, Y. P. (2012). Disfluencies and gaze aversion in unreliable responses to survey questions. Journal of Official Statistics, 28 (4), 555 – 582. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/disfluencies-and-gaze-aversion-in-unreliable-responses-to-survey-questions.pdf
Smith, V. L., & Clark, H. H. (1993). On the course of answering questions. Journal of Memory and Language, 32 (1), 25 – 38. https://doi.org/10.1006/jmla.1993.1002
Stout, R. (1981). New approaches to the design of computerized interviewing and testing systems. Behavior Research Methods and Instrumentation, 13 (4), 436 – 442. https://doi.org/10.3758/BF03202052
Tourangeau, R., Conrad, F. G., & Couper, M. P. (2013). The science of web surveys. New York: Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199747047.001.0001
Tourangeau, R., & Smith, T. W. (1996). Asking sensitive questions: The impact of data collection mode, question format, and question context. Public Opinion Quarterly, 60 (2), 275 – 304. https://doi.org/10.1086/297751
Turner, C. F., Ku, L., Rogers, S. M., Lindberg, L. D., Pleck, J. H., & Sonenstein, F. L. (1998). Adolescent sexual behavior, drug use, and violence: Increased reporting with computer survey technology. Science, 280 (5365), 867 – 873. https://doi.org/10.1126/science.280.5365.867
Yan, T., & Tourangeau, R. (2008). Fast times and easy questions: The effects of age, experience and question complexity on web survey response times. Applied Cognitive Psychology, 22 (1), 51 – 68. https://doi.org/10.1002/acp.1331
By Stefanie Fail; Michael F. Schober and Frederick G. Conrad
Reported by Author; Author; Author
Stefanie Fail is the Global Lead for Conversation Experience Design at Nuance Communications, where she designs conversational voice AI interfaces for some of the largest US and Global enterprises. She holds a PhD in Psychology from The New School for Social Research.
Michael F. Schober is Professor of Psychology and Vice Provost for Research at The New School. He studies shared understanding—and misunderstanding—in survey interviews, collaborative music-making, and everyday conversation, and how new communication technologies (e.g., text messaging, video chat, automated speech dialogue systems, social media posts) are affecting interaction dynamics.
Frederick G. Conrad is Research Professor and Director, Program in Survey Methodology, University of Michigan. His research concerns improving survey data quality (e.g., via texting, two-way video, and virtual interviewers) as well as new data sources (e.g., social media posts and sensor measurement) for possible use in social and behavioral research.