Page 1 of 1

stats query

Posted: Sat Jan 13, 2024 10:37 am
by Allo V Psycho
I have a variety of data sets and need to compare them for differences. The histograms show quite a complex relationship: typical example shown. What tests would I need to show if the histogram shown was statistically different from another more or less similar one? Grateful for any help from hive mind.

Re: stats query

Posted: Sat Jan 13, 2024 10:51 am
by bob sterman
The variable on your x-axis of the histogram - is that a continuous variable that has been binned to create 18 categories for the histogram? Or is it an ordinal variable that has 18 categories?

Re: stats query

Posted: Sat Jan 13, 2024 11:34 am
by Allo V Psycho
bob sterman wrote:
Sat Jan 13, 2024 10:51 am
The variable on your x-axis of the histogram - is that a continuous variable that has been binned to create 18 categories for the histogram? Or is it an ordinal variable that has 18 categories?
Ordinal with 18 categories.

Re: stats query

Posted: Sat Jan 13, 2024 1:58 pm
by jimbob
Allo V Psycho wrote:
Sat Jan 13, 2024 11:34 am
bob sterman wrote:
Sat Jan 13, 2024 10:51 am
The variable on your x-axis of the histogram - is that a continuous variable that has been binned to create 18 categories for the histogram? Or is it an ordinal variable that has 18 categories?
Ordinal with 18 categories.
Ouch.

Are there fewer parent categories you could use to combine some into?

Re: stats query

Posted: Sat Jan 13, 2024 2:17 pm
by Allo V Psycho
jimbob wrote:
Sat Jan 13, 2024 1:58 pm
Allo V Psycho wrote:
Sat Jan 13, 2024 11:34 am
bob sterman wrote:
Sat Jan 13, 2024 10:51 am
The variable on your x-axis of the histogram - is that a continuous variable that has been binned to create 18 categories for the histogram? Or is it an ordinal variable that has 18 categories?
Ordinal with 18 categories.
Ouch.

Are there fewer parent categories you could use to combine some into?
Alas, no... all unique.

Re: stats query

Posted: Sat Jan 13, 2024 2:58 pm
by science_fox
How many samples have you got? Could you PCA them which will group those sets with more common features together, even if it's a bit complex to work out what makes them common.

Not my area of expertise!

Re: stats query

Posted: Sat Jan 13, 2024 3:13 pm
by bob sterman
To compare 2 of these distributions - how about a 2 x 18 chi-square test?

https://www.icalcu.com/stat/chisqtest.html

Re: stats query

Posted: Sat Jan 13, 2024 8:27 pm
by KAJ
"statistically [significant] differen[ce]" is rarely of real interest. See Wikipedia. What kind of difference (location, dispersion, ...) and what size of difference would you consider worthy of mention?

Re: stats query

Posted: Sun Jan 14, 2024 10:02 am
by Allo V Psycho
Bob:
That seems plausible: Thanks for the handy calculator!
KAJ:
Yes, I think this is where I was heading. The numbers are so big that even quite a small difference might be statistically significant. I generally phrase this as "statistical significance is not the same as biological significance", and would aim to calculate the effect size.
As these data sets emerge (it's taking me quite a long time to analyse them), I think they plainly DO fall into the 'not biologically meaningful' category.

Thanks to all responders!

Re: stats query

Posted: Mon Jan 15, 2024 9:11 am
by IvanV
In time series analysis, there are frequently high correlations between unrelated things, because anything with a time trend will have a high correlation with anything else with a time trend. Many completely different things have time trends. This raises a potential issue with these. Maybe the categories are such that many things tend to have similarities across those categories, for no particular reason.

Re: stats query

Posted: Mon Jan 15, 2024 5:41 pm
by sTeamTraen
IvanV wrote:
Mon Jan 15, 2024 9:11 am
In time series analysis, there are frequently high correlations between unrelated things, because anything with a time trend will have a high correlation with anything else with a time trend. Many completely different things have time trends. This raises a potential issue with these. Maybe the categories are such that many things tend to have similarities across those categories, for no particular reason.
Yes, Bob's suggestion of the 18-way chi-square test will not make sense if the histograms being compared are, say, the prevalence of 18 diseases in the UK in 2022 versus 2023. It would be OK for the prevalence of those diseases in the UK in 2022 versus France in 2022, etc.

FWIW, I often use a 10-way chi-square test if I think that the distribution of trailing digits in a dataset looks dodgy (because the numbers have been "proctologically derived"). You can compare the actual distribution against uniform, or if you want to be extra fancy and have good reason to think that Benford's Law is operating, you can compare with the distribution for the Nth digit predicted by that law (which is famous for the exponential decay-type curve for leading digits, but also makes predictions about the 2nd, 3rd, etc).