Statistical Methodology and Single-Subject Limitations
Language note: Percentages throughout this series describe DNA sequence similarity to reference populations, not fixed ancestry fractions. Following Kampourakis & Peterson (Genetics, 2023), the term “admixture” has been avoided. All calculator outputs are statistical similarity scores relative to defined reference groups. Kampourakis K & Peterson EL. Genetics 223(3), iyad002 (2023) · Fortes-Lima CA et al. Am J Hum Genet. 2025;112(2):261–275 · David LT. Am Anthropologist. 2024;126(1):153–157 · Sibomana O. Pharmgenomics Pers Med. 2024;17:487–496
This addendum addresses the statistical dimensions of the research methodology — specifically the approaches that serve as practical proxies for formal statistical analysis, and the limitations that apply to any single-subject DNA similarity study. It is offered for readers with quantitative research backgrounds who may wish to evaluate the rigour of the methodology more formally.
Section 1The Single-Subject Limitation
What it means
This research is based on the DNA similarity profile of one individual. It is not a population study, a clinical trial, or a comparative analysis across multiple subjects. Every finding documented in this series — the West African signal at 55.04%, the Fulani Oracle result at distance 14.18, the North African component at 23.98% — applies to one genome and cannot be generalised to any broader population.
This is an explicit constraint of citizen science genealogy, and it is stated here directly rather than buried in qualifications. A graduate committee reviewing this work would identify it immediately. It does not invalidate the findings — it defines their scope.
The appropriate frame for this research is descriptive, not inferential. The findings describe one person's DNA similarity profile with consistency and rigour. They do not support population-level claims, and none are made.
Why it does not undermine the methodology
Single-subject analysis is well established in fields where individual variation is the object of study rather than a confound to be controlled. Genealogical DNA research is precisely such a field. The question being asked — what does this individual's genome show similarity to, and what does that mean for their ancestral heritage — is answerable at the individual level. Population statistics are not required to answer it.
What is required is consistency across independent instruments — which is precisely what the four-calculator methodology was designed to produce.
Section 2Calculator Agreement as a Proxy for Statistical Confidence
The convergence test
In the absence of formal statistical significance testing, this research employed a convergence test as a proxy for confidence: signals appearing consistently across all four independent calculators were treated as more reliable than signals appearing in fewer. This approach is methodologically defensible for the following reasons:
- Independence of instruments. The four calculators — EthioHelix K10, Dodecad Africa9, puntDNAL K8, and MDLP K23b — use different reference panels, different algorithmic approaches, and were developed independently. Consistent signals across all four are unlikely to reflect a shared systematic error.
- Direction of agreement. Where calculators disagreed on magnitude, they consistently agreed on direction — all four identifying West African similarity as the dominant signal, all four identifying a North African component. Directional agreement across independent instruments is a meaningful form of validation.
- Oracle distance as a quantitative measure. The Oracle tool produces a numerical distance score — lower scores indicating closer similarity to a reference population. The Fulani result at distance 14.18 (EthioHelix) and 9.48 (puntDNAL), more than twice as close as the next nearest population on both calculators, provides a quantitative basis for the finding that is not dependent on qualitative interpretation alone.
A formal statistical treatment would express calculator agreement as a correlation coefficient or concordance measure across the four instruments. The convergence test employed here is a practical proxy for that measure — less precise, but directionally equivalent and appropriate for citizen science methodology.
What formal statistics would add
A graduate-level statistical methodology section would add the following, none of which alter the findings but all of which would strengthen their formal defensibility:
- Inter-rater reliability coefficients measuring the degree of agreement across the four calculators on each regional signal.
- Confidence intervals around the Oracle distance scores, accounting for calculator noise and reference panel variance.
- Sensitivity analysis testing how much the findings change when each calculator is removed from the convergence test — i.e., whether the Fulani result holds when only three of the four calculators are considered.
- Reference panel documentation formally characterising the composition of each calculator's reference panel and the known gaps in African population coverage.
These additions would not change the conclusions. The West African signal at ~55%, the Fulani Oracle result, and the North African component at ~24% are robust across all four instruments and are consistent with the population genomics literature (Fortes-Lima et al., 2025). Formal statistics would quantify the confidence around those findings — they would not overturn them.
Section 3The Oracle Distance Measure
What it measures
The GEDmatch Oracle tool computes a distance score between an individual's DNA similarity profile and the profiles of reference populations in the calculator's panel. A lower score indicates closer similarity. It is a mathematical distance measure — not a probability, not a percentage, and not a statement of direct descent.
Interpreted correctly, a Fulani Oracle distance of 14.18 means that of all reference populations in the EthioHelix panel, the Fulani profile is the closest mathematical match to this individual's similarity profile. It does not mean the individual is Fulani, has Fulani ancestors, or is more Fulani than anything else. It means the Fulani reference population, as defined by the EthioHelix panel, is the nearest neighbour in the similarity space the calculator constructs.
The nearest-neighbour interpretation is the correct one. Oracle results identify the closest reference population match — they do not identify ancestry in the genealogical sense. This distinction is consistent with the Kampourakis and Peterson framework applied throughout this series.
Why the Fulani result is nonetheless significant
The significance of the Fulani Oracle result lies not in its absolute value but in its relative distance from the next nearest populations. On EthioHelix, the Fulani result at 14.18 is more than twice as close as the next nearest population. On puntDNAL, the distance of 9.48 is similarly dominant. This relative gap — consistent across two independent calculators — is the basis for treating the Fulani as the closest population match, not a single number in isolation.
The Fortes-Lima et al. (2025) paper on Fulani population genomics provides the scientific context that makes this result interpretable: the Fulani are a Sahelian pastoralist population carrying documented West African, East African, and ancient North African genetic components, consistent with the broader multi-regional signal documented throughout this research.
Fortes-Lima CA et al. “The Genetic History of the Fulani.” Am J Hum Genet. 2025;112(2):261–275. doi:10.1016/j.ajhg.2024.12.015
Section 4Appropriate Claims and Their Limits
The following summarises what this research can and cannot claim, stated precisely:
| The research can claim | The research cannot claim |
|---|---|
| This individual's DNA shows the highest similarity to West African reference populations across all four calculators (~55%) | That 55% of this individual's ancestors came from West Africa |
| The Fulani reference population is the closest Oracle match on two independent calculators, at more than twice the distance of the next nearest population | That this individual has Fulani ancestors or is of Fulani descent |
| A North African component (~24%) appears consistently across all four calculators | That this component represents a specific number of North African ancestors |
| The multi-regional signal is consistent with documented Fulani population genomics (Fortes-Lima et al., 2025) | That the Fortes-Lima findings apply specifically to this individual's lineage |
| The convergence of four independent calculators increases confidence in the directional findings | That formal statistical significance has been established |
These are the appropriate boundaries for citizen science DNA similarity research. They are stated here not as apologies but as precision — the difference between what the data shows and what it proves is a distinction worth maintaining.
This addendum was prepared in response to questions about the statistical dimensions of the methodology. The research documented in this series is citizen science, conducted with rigour and transparency. Readers wishing to apply formal statistical methods to similar data are encouraged to do so — the methodology guide provides the framework from which such an analysis could begin.
Kampourakis K & Peterson EL. Genetics 223(3), iyad002 (2023) ·
Fortes-Lima CA et al. Am J Hum Genet. 2025;112(2):261–275 ·
Sibomana O. Pharmgenomics Pers Med. 2024;17:487–496 ·
GEDmatch Kit NJ7476284
No comments:
Post a Comment