Some Reflections on the Subjective Nature of Psychological Measurements

Dec 1, 2023

Last month I finally decided to address my left shoulder problem, which has bothered me for probably a decade or two (I can’t recall when it started and it feels as if it has always been there). The problem is that I find it quite effortful to externally rotate my left arm. It feels like I was born with some abnormal bone structure that prohibits me from properly rotating my arm, at least this was what I considered highly probable before seeing the PT (physical therapist).

The root giving rise to this problem, after going through several physical therapy sessions, came out to be my inability to properly control and position my left shoulder blade, which probably in the long term led to the weakness in my rotator cuff. Collectively, these issues led me to believe I had a deformed bone structure, which, in reality, constitutes a more intricate and complex combination of problems related to posture, muscular weakness and compensation, and a deficiency in kinesthesia. It was an exotic experience when, under the guidance of the PT, I first tried to activate my left rotator cuff with the shoulder blade in the correct position. It was extremely difficult, but still, I managed to mobilize and externally rotate my arm slowly. It hurt a lot, and with only three repetitions of the exercise, my arm started trembling and couldn’t go further. Nevertheless, it was the first time in my life that I had ever rotated my arm to such an extent using the muscles that my mind was not aware of. My kinesthetic vector space, spanned by the muscles I can consciously control, was expanded with another dimension.

What if subjective reports were all we have?

During the physical therapy sessions, the PT utilized multiple strategies to assess my condition. For instance, I was told to perform certain movements, some of which were resisted by the PT. These assessment methods are highly standardized, and in some sense resemble those conducted by clinical psychologists. However, there is a stark difference—although both of them involve subjective reports (e.g., I was often asked to report tightness or soreness during the sessions), the PT’s assessment is more “controlled” than many psychological assessments in the sense that there exist some objective bases that allow the assessment to be grounded in the physical world. These objective controls are, for example, the aforementioned assigned movements and the resistance added against them.

Many psychological assessments, on the other hand, consist purely of subjective ratings. How could this affect the assessment? To illustrate this, let’s walk through my physical therapy sessions again but in a mentalist fashion this time. Suppose I went through the therapy sessions just as I mentioned above in the article. The difference, however, is that instead of assessing my condition with the standard PT tools, in this hypothetical scenario, my subjective ratings of “how well I could control the rotator cuff” were given direct credits, and two of such ratings were collected before and after the sessions respectively. As a guy who had never properly activated his rotator cuff for over a decade prior to therapy, my perception of the rotator cuff was extremely blunted. In fact, it turns out that I erroneously attributed my controlling of the upper trapezius to that of the rotator cuff—I falsely believed that I had a well-functioning rotator cuff until going through the sessions. As a consequence, I rated my rotator cuff stronger before therapy even though, after going through several weeks of proper training, my rotator cuff functioned better.

In the previous hypothetical scenario, subjective reports are completely misleading. Thinking carefully through this reveals a subtle assumption—the mental ruler has to stay the same (i.e., the respondent has to respond to the scale according to identical standards across measurement occasions) for the subjective reports to be valid within this context. However, physical therapy systematically expands my experience and thereby modifies my mental ruler. The subjective reports before and after therapy could therefore NOT be interpreted against a common scale. I broke the assessment.

When do psychological scales work well?

Being told the above story, it might feel as if psychological assessments, predominantly consisting of subjective ratings, are unreliable. Why are they so prevalent then? It’s now a good opportunity to look closely at the context and assumptions of these tools.

Psychological scales, and indeed most psychological studies, explain phenomena at the population level. Therefore, anything has to be interpreted with reference to a particular population. Knowing the score of an individual on a scale explains nothing unless we also have access to a population of scores. This is the psychological way psychologists use to ground subjective experiences in reality—by comparing an individual’s subjective rating to that of a population. Therefore, the meaning of these psychological scores changes if the reference population has changed¹, and we can only vaguely understand a score by learning about this reference population and the items constituting the scale.

This brings us to the context in which subjective psychological scales are developed and applied. As hinted above, the focus is on the population. Consequently, responses are typically collected from all individuals ONCE in the population of interest. That is, the scales are intended to be administered cross-sectionally. A natural corollary to this is that, since each response² comes from a different individual, any two responses are independent. In other words, knowing the score of an individual provides no information about the score of another individual. Compare this to the scenario where the scale is repeatedly applied to collect multiple responses from the SAME individual. Now, the responses collected at different time points are not independent—knowing that an individual scored, say 30, on the scale last week helps tremendously in predicting his/her score this week. The responses from the same individual at different time points should still be independent, however, when conditioned on the construct the scale is measuring. In simpler terms, when we remove the construct’s contribution to the scores, anything left behind should provide no information for predicting any other responses. This is the rationale justifying the application of a scale, originally developed in a cross-sectional context, in a longitudinal setting.

So yeah, it seems safe to apply these scales in longitudinal settings as long as the scales are well-developed such that we can measure some construct, doesn’t it? The problem is that the construct, as understood by the respondents, could change as time passes by. This is particularly likely considering that longitudinal studies often involve some form of “treatment” in which not only the levels of some latent ability but also the construct itself may change systematically. In the case of my PT example, the “treatment” involved postural correction and muscular training, which systematically changed and increased my awareness of a set of forgotten muscles. Therefore, my understanding of “how well can I control my left rotator cuff” changed before and after treatment. Essentially, I was rating my ability to control the left upper trapezius instead of the left rotator cuff prior to treatment.

This issue does not arise in cross-sectional settings where each individual responds to the scale only once. In such settings, even if some individuals understand the scale “incorrectly” (i.e., not as the intended construct the scale is targeting), as long as the bulk of the respondents do so correctly and those who misunderstand do so randomly (e.g., they do not tend to misunderstand in a particular way), the collected scores would still be valid when averaged at the population level. The individuals who misunderstood can be viewed as a source of measurement error, and since the directions of misunderstanding are random, they cancel out at the population level.

When scales are developed in a cross-sectional setting, they are tested and validated at the population level (i.e., across individuals). It is thus generally safe to apply these scales in similar cross-sectional settings. On the other hand, since they have never been validated with longitudinal use in mind, it is unclear whether they are still valid in such settings. In such cases, applying the scales longitudinally introduces subtleties such as the above implicit assumption of “construct invariance”³, where the “mental ruler” of an individual is assumed to stay constant across time. Construct invariance is extremely difficult to ensure for purely subjective measures. Nonetheless, the demand for longitudinal applications is still high. How, then, might we maintain construct invariance?

The case of educational assessment

Physical therapy assessments present a case where construct invariance can be more or less maintained by providing objective bases (i.e., assigned movements) for subjective mental rulers to anchor upon. Nevertheless, physical and psychotherapy might differ too much to allow for a good analogy. So let’s switch gears to a sister field—educational assessment—which also assesses somewhat abstract constructs much like psychology.

In educational assessment, the goals are to assess some abilities such as mathematical skills and language proficiency in individuals. The concepts, theories, and analytic frameworks applied to developing assessment instruments for these goals are essentially identical to what psychologists use, such as reliability, validity, classical test theory, item response theory, etc. However, the measurement tools in educational assessment could generally be applied in a longitudinal setting without threatening much of their validity. This is true since they differ from most psychological measurements in that they completely remove subjectivity from scoring—items in educational assessments are often dichotomously scored (i.e., a response is either correct or incorrect based on some fixed criteria independent of respondents’ subjectivity). Consequently, as long as factors such as motivation, test-retest biases⁴, testing environment, etc. are controlled for, the interpretation of the assessments will be straightforward.

Addressing subjectivity

I used to be fascinated with psychologists’ ideas of measuring abstract constructs through subjective reports. More recently, however, I feel we have been stumbling on subjectivity for too long. Subjective experiences are by nature vague. Vagueness, unfortunately, is detrimental to the progress of science.

Unless anchored to some criteria, it would be impossible to evaluate subjective experiences. Psychology has traditionally achieved this by anchoring individuals to a population (technically speaking, by introducing probability models). This handy approach is valid until one starts thinking about treatments and repeated measures where the time dimension is incorporated. Subjectivity, a property psychologists cleverly sidestepped by utilizing probabilities and random variations, has resurfaced to mess up our measurements in an era when longitudinal inquiries are most imperative. To deal with subjectivity, we’ve got two options.

The first option, as hinted throughout the article, is to develop new scales that can at least be partially anchored to some objective bases beyond subjective experiences. In the fields of physical therapy and educational assessment, this is readily achieved because they both have concrete theories pertaining to their areas of focus. Physical therapists know how joints, muscles, and nerves function together to produce proper movements. This body of knowledge is applied directly to develop relevant assessments. Similarly, for developers of math assessments, the knowledge that, say, correctly solving the item $\frac{5 + 3}{4} = ?$ requires proficient skills in addition and division, licenses the measuring of these skills. Conversely, psychologists know very little about the processes bridging constructs to subjective ratings and maybe even less about the constructs themselves, let alone grounding subjectivity in objective bases. The crux of the problem, therefore, lies neither in statistics nor methodologies but in the need for better theories. Psychologists lack good theories to improve measurements, and poor measurements bring about no theory refinement.

The second option is to explicitly model the threats to construct invariance by, for instance, introducing additional measures (subjective or not). This would require very strong assumptions and, again, a good theory to clarify the potential intricate interactions among the constructs being measured. These requirements are hard to fulfill in practice since we are left with the current option for the very reason that we lack a good theory in the first place. Nonetheless, it is logically viable to address such threats to construct invariance. If you’re curious about the assumptions and theories involved, you can take a look at one such naive analysis.

This is also why evaluating and comparing studies in psychology is difficult. After all, different populations are studied, and therefore statistics such as effect size measures cannot be directly compared across studies unless strong assumptions and additional efforts are made to ensure the comparability of the populations. ↩︎
Or more precisely, each “set of responses” since there may be multiple items (and thus multiple responses) collected from a scale administered to an individual. ↩︎
I coined this term by borrowing measurement invariance. The two terms are conceptually similar, but measurement invariance focuses on groups (e.g., invariance across gender or cultural groups) whereas “construct invariance” here emphasizes invariance across measurement occasions within the same individual. In addition, measurement invariance is usually discussed in abstract statistical terms. On the other hand, “construct invariance” is introduced to help clarify the underlying process. We are thinking forward to reason about under what conditions (i.e., assumptions) the process generates data that can be analyzed consistently across both longitudinal and cross-sectional settings. ↩︎
The memory of items seen in previous assessments is one such example. In educational assessment, this is usually controlled by using parallel forms (different versions of the same assessment or test) in retests. ↩︎

PREV Pick and Stick to a Convention

NEXT Nutrical