Scientific Method, Conditional Probability, and p-value

In the last episode, I have discussed the basics of probabilities. To recap, the probability is the measurement of the characteristics of a population, not individuals of the population. If we have a bag of marbles with half of which are red and the other half are blue, when we draw one out of the bag, and all marbles have the same chance to be drawn, we have a 50% chance of getting a red marble. That is a measurement of the bag. It does not tell us what the color of the marble is if it is coming from the bag. Because marble is either red or blue, it can not be 50% of each.

When we get a marble out of the bag, even though we can not draw conclusions about the color of the marble, we can form beliefs about it. This is called the Bayesian view of probability. We call it beliefs instead of conclusions because we can not guarantee what we say about the marble to be correct every time. Consider the following experiment: I draw a marble out of the bag, guess what its color is, then put it back, and repeat. If I say the marble is red, 50% of the time, I will be right. Similarly, if I say the marble is blue, 50% of the time, I will also be right. So I can say that I have 50% confidence that the marble is red, and 50% confidence that the marble is blue, though I can not say what the color of the marble is for sure. The way to verify my confidence is to conduct the experiment above and repeatedly draw marbles from the bag again and again, when we have collected enough data, to see if the ratio of the two colors converges to 50-50. Note that we have to redraw the marble every time so that we do not examine the color of any specific marble, but the color of every marble drawn from the bag. Because once again, the probability is the characteristic of the bag or population not of any given marble. If we keep examining the same marble, then we will always get red if the marble is red. Then if our guess was blue, we would be wrong all the time, instead of getting it right half of the time. Therefore, our 50-50 confidence is not about any specific marble, but that if we draw marbles repeatedly from the bag, we know we will get it right 50% of the time if we guess red, or blue.

Now consider a new bag of marbles, with three different colors instead of two, say red, green, and blue. If I draw one out, with similar reasoning, we can say that we have 1/3 confidence that it is red, 1/3 confidence that it is green, and 1/3 confidence that it is blue. What if, after the draw, I first checked the marble for you, and tell you that the marble is not green. What should your confidence be? Because we know that now the marble can only be either red or blue, and they have an equal chance be drawn from the bag, therefore, our confidence should be 50% red, and 50% blue. This is all and well reasoning. But in the spirit of the scientific method, we also need to verify that it is correct. How can we design an experiment to see if our confidence is correct? We can repeat the similar experiment as above with some modifications. We take a random marble from the bag, if the marble is green, we put it back directly and do not count this sample. If it is not green, we count if it is red or not red. Afterward, we put it back. We repeat this process and until we have large enough samples, then we compute the ratio between the number of red and not red, or in this case, blue, marbles that we counted to see if it is close to half-half split. This is the formal way to verify our conditional confidence. We can call it post-sampling filtering, because it still takes samples from the population as how we verify the regular probability and filtering the samples after by removing those that do not satisfy the condition. In this case, the samples are green. Besides post-sampling filtering, there is another common process that can also get us the correct answer sometimes, if used properly, that is the pre-sampling filter.

Pre-sampling filter, as the name suggests, filters before we take samples. We can first remove all of the green marbles from the bag, and then start taking random samples from the bag and calculating the ratio between red and non-red samples. In this case, we know that we will never get any green marbles, as we have already removed them from the bag, therefore, the ratio between red and not red samples is the conditional probability of drawing a marble from the original bag that is red, given that we know it is not green. I want to note that the pre-sampling filter changes the population. After removing all of the green marbles from the bag, we are no longer sampling from the same bag and therefore, we are not answering the same question. The original question was about a random marble taken from the bag with mixed green, red, and blue marbles. But the bag we are sampling now only has red and blue marbles. The pre-sampling filter has changed what we are measuring. But it does happen to have the same answer to the original question if removing the green marbles does not influence the chances between the red and blue marbles being drawn.

In the marble example, it may be bluntly clear therefore we can assume that removing the green marbles won’t impact the chance between the red and blue marbles being drawn. After all, how would it be that by removing the green marbles, we can make red marbles less or more likely to be drawn than the blue? Even though it is intuitive to just assume that, but it is generally bad scientific methodology to make assumptions without evidence or data. For instance, some red balls could have been removed by accident if the person who performs the task suffers from red-green colorblindness. There are also far less obvious examples for potential errors in the pre-sampling filter when measuring conditional probability.

Consider three products on the market, that serve similar functions. They may be phones, computers, or paper towels, but they are of different qualities and prices. You want to conduct an experiment to see the ratio between people who buy the lowest tier product vs top tier product. Because you don’t really care about mid-tier products, you only asked your participants to choose between the lowest and top tier. It may seem to be a sensible way to conduct the experiment only include things you care about. But in fact, removing the mid-tier product from the market actually changes what people buy. This is what market phycologists referrers to as the decoy effect. Dan Ariely used the following real-world example in his book Predictably Irrational:

The Economist magazine offered three types of subscriptions: the cheapest option was 59 dollars a year that provide only online access. The middle option was 129 dollars a year that provided only prints. The top-tier option provided both prints and online access for 129 dollars. Yes, the same price as the middle tier but with online access thrown in. It may seem that no one would ever choose the middle tier as it is the same price as the top tier but offers fewer benefits. And you would be right. The middle option was not presented for people to buy, but to show how good a deal the top tier option is, and therefore incentivize people to purchase the top tier option. Indeed, Ariely conducted his own experiment amongst his students, that when presented only the low tier and top tier options, students are more leaning towards purchase the lower tier, maybe because of their limited budget, but once the middle tier was introduced to the mix, more people chose the top tier than when there is no middle-tier option.

Therefore, it may seem sensible to remove options that we don’t care about before we sample. But, in this case, by removing the mid-tier options, we actually changed the probability distribution of the other options and distorted our answer. Instead, we should perform a post-sampling filter, and sampling with all the three options first, only afterward, remove the samples we do not care about.

Here is another hypothetical example of sampling bias introduced by the pre-sampling filter. Let’s say that we want to measure low-income families’ access to broadband internet. To remove high-income families from being sampled, we did the surveys in the low-income neighborhoods around Los Angeles. This filter did exclude the wealthy. But it also introduced sampling biases as not all low-income families, especially those in the rural regions in the U.S., have the same internet access as low-income L.A. neighborhoods.

I hope that you can see now, that not all samples are created equal. Probability is more than just data, but how we collect them. When comes to conditional probability, the post-sampling filter is the true measurement of the empirical conditional probability even though it means collecting more data than sometimes needed. The pre-sampling filter assumes that the filter we applied does not distort the underlying probability, which is often hard to guarantee. And instead of just assuming it, we need to provide data of both pre-sampling and post-sampling filters to show that they do get the same results.


To recap, conditional probability, is the ratio between the number of positive samples and the number of total samples, given some conditions that all samples have to satisfy. In the marble example, the positive samples are the red marbles sampled from the bag, the condition is that the sample is not green marbles. So the total number of samples that satisfy the condition, i.e. not green, is the number of red or blue marbles drawn.

Now I have discussed the conditional probability, I will introduce null-hypothesis testing and p-value, which are the bread and butter of the so-called data-driven science. I want to remind you once again, that probability is the characteristic of a population, not any single entity of the population. For example, the probability of drawing red balls from a bag is the ratio between the number of red balls drawn from the bag and the total number of balls drawn in a repeated sampling process: taking a ball out, see if it is red or not, then put it back, and repeat. It does not however tell us if a ball is red or not given that it is from the bag unless the probability is 100% or 0%. In those cases, we don’t really need probability for computation anymore. I am emphasizing it again here because this point will be crucial when we discuss the probability of a hypothesis being true later.

A hypothesis is just a proposition about a group that we want to test its truth value of. For example, all swans are white is a hypothesis. In the falsifiability principle episode, I have extensively discussed how to test hypothesis is one of the pillars of the scientific method as proposed by Karl Popper. One of the keys to differentiate science from pseudoscience is that scientists try to prove their hypothesis to be false, while many pseudosciences try to prove their hypothesis to be true. It may be weird to try to tear down your own hypothesis. But it is for a good reason because science studies universal laws, so we can use those universal laws to predict and manipulate the future. Consider the following hypothesis: All swans are white. There is no way for us to prove, logically, that this statement is true, as it requires us to examine all swans from the past present, and future and make sure they are all white. But we can easily prove that it is false, by simply find a non-white swan. It doesn’t have to be black, any color that is not white would suffice. So instead of trying to find all the white swans around the world to confirm our hypothesis, the scientific method asks us to try our hardest to find swans that are not white, and only in failing in finding one, we may consider, not prove, that our hypothesis to be true. In contrast, many pseudoscientific claims simply find some data that supports their hypothesis and say that the positive data has “proven” their hypothesis while ignoring the negative ones.

The hypothesis is the proposition that we want to test the truth value of, but what is the null-hypothesis? There is a broad misconception that null-hypothesis is the negation of the hypothesis. For example, if I have a drug that I hypothesize that will improve test scores, the null-hypothesis would be that it does not help improving test scores. It may sound perfectly sensible in English, but logically, it does not hold water. The negation of a statement means that two statements can not both be true at the same time and also if one is false the other one must be true. For instance, the negation of all swans are white is some swans are not white, not all swans are not white. Because if some swans are grey, some are white then both all swans are white and all swans are not white are false. In this case, is my hypothesis that the drug will always improve test scores? Or only sometimes it will? Because the negation of the two is not the same. If my hypothesis is that the drug will always improve test scores, then the negation of it would be that sometimes the drug does not improve or even decrease test scores. If my hypothesis is that the drug will sometimes improve test scores, then the negation of it would be the drug would never improve test scores. In reality, the null-hypothesis is actually not a negation of the hypothesis, but the hypothesis that two groups of samples are coming from the same population.

For instance, if you have two groups of test takers, taking the same test and you get two sets of test scores, one from each group, the null-hypothesis is that the two sets of scores are coming from the same population. You may have noticed now, the null-hypothesis has nothing to do with the drug at all! Even if we can show that the two groups of samples are not from the same population, we can’t really be sure that it is the drug that has caused the difference. Correlation is not causation. For example, if you are looking into a marketed drug that supposedly improves test scores, so you sent out a survey to the students who bought the drugs and asked about their scores and compared it to the average score of other students who didn’t take the drug and found out that the student who bought the drugs have a higher average score than others. Assuming that this evidence is enough to conclude the null-hypothesis to be false, which I will show later that it does not, what it tells us is that the student who takes the drug is a different population from the students who don’t, it does not tell us that the drug actually improves scores. There could very well be other reasons why those two groups are different. For instance, maybe the group who bought the drug are significantly wealthier as they can afford it, therefore have access to better educations. A controlled experiment can eliminate some of the differentiating factors. But to make those experiments satisfy the criteria of the scientific method: observability, falsifiability, and reproducibility, we will have to reconsider those experiments in a different framework from null-hypothesis testing and p-value. But I will get to it later. Let’s talk about the p-value first.

The p-value is the conditional probability of the data given that the null hypothesis is true. A couple of things, first, the p-value is a conditional probability, therefore, to measure it, we will need to use the post-sampling filter that I have talked about above. I will get to measuring the p-value in a bit. Secondly, because we want to show the null hypothesis to be false, though knowing that that null-hypothesis is false doesn’t really show that our original hypothesis to be true as I have noted above, but let’s put it aside for now, we want to see that the p-value is small. Intuitively, if we have a small p-value, which means that the event is unlikely to happen if the null hypothesis is true, therefore we can reject the null hypothesis. I said intuitively, not logically, because this reasoning is not mathematically sound. It may sound like the falsifiability principle in the scientific method, but there is a key difference. Impossible events can never happen, but unlikely events happen all the time. If I hypothesis all swans are white, finding a black swan is enough to put a nail on the coffin of my hypothesis. But if I hypothesis 99% of swans are white, finding a black swan doesn’t really say much about the validity of my hypothesis. It is hard to win lotteries, but there are people around the world who win lotteries every day, not always the same person, just some people will win the lottery each day. I will elaborate more on how falsely think small p-values mean that the null hypothesis is false can lead to problems later when I talk about p-hacking. But before that, let’s first see how can we measure the p-value.

So how can we measure the p-value? Well, we have to use the pre-sampling filter as mentioned above. We collect samples from the whole population, then filter out the ones for which the null hypothesis is false. You may have noticed some problems here. If our hypothesis is about a universal law of nature. The null hypothesis is either true or false, it is not possible to have some samples taken while the null hypothesis is true and others were taken while it is false. Just to show the absurdity of the logic, let’s say our null hypothesis is the gravity does not exist on earth. To measure the p-value, we just need to throw objects while standing on the ground and count how many times the object flow into space versus how many time it lands on the ground while there is no gravity on Earth. It is of course not achievable, either there is gravity on Earth or there is not. We can’t take samples for both cases. Similarly, if the null hypothesis is true, all of our samples taken will be under the law of the null hypothesis, while if the null hypothesis is false, all of our samples will be taken under some alternative laws. There is no way to take samples under two alternative physical laws, as that contradicts our fundamental scientific assumption of the universe that it is bound by a set of fixed and consistent laws that do not change with space or time. Therefore, we can never really measure the p-value. But only calculate it, with a collection of assumptions.

The computation of the p-value is pretty straightforward. But I won’t go into the equations, as that is not the point here. Instead, I want to dive deeper into the assumptions we are making while doing the computation. As I have emphasized many times before, when making scientific enquiries, assumptions without justification and verification can introduce biases into our reasoning and as a consequence tint our conclusion. More dangerously, we often embed our own assumption of the world into our arguments without knowing it, making us think that we are objective, while in fact, we are only making arguments base on our subjective assumptions that often don’t have any scientific or logical standing. Therefore, it is important to recognize the assumptions we are making when we reason, and instead of taking those assumptions to be a given, find ways to back them up with evidence and scientific method.

OK, the computation of the p-value. Let’s consider our smarty drug again. We have two groups of people, one takes the drug, the other one doesn’t. Both of them participate in the same test, and each has a separate score distribution. Our null hypothesis is that the two groups of scores are taken from the same distribution, therefore, we can use the group that did not use the drug as the baseline distribution, then calculate what is the chance to get the set of scores of those who have taken the drug, if they were sampled from the baseline distribution. But there is a problem, any specific set of scores are in general statistically improbably for any large enough sample size. For instance, if I want to test if a coin is fair, I flip it 6 times, I got HTHTHT. Even though it looks like the coin is fair, the chance I get this sequence is only 1 in 64, not very likely. Therefore, for p-value, instead of looking at the likelihood of the specific sequence of data, it actually looks at summary statistics, just as the name suggests, a summary of the data. For instance average scores, or average headcounts, etc. So, here is the first assumption we are making, that the summary statistic is sufficient to represent the important information of the data. For example, if we choose average score as our summary statistics, we will overlook other information such as variance. This drug may make the average score 5 percent higher, but it might make some people fail the exam because it gives them massive headaches. The second assumption we are making here is that the sampled data is enough to capture the underlining distribution. For instance, if I want to compare if two coins are identical, I throw the first coin 5 times and got 5 heads, then I am using it as a baseline I will construct the underlining distribution as it always gives me heads. While maybe the coin is perfectly fair but I have just got very unlucky with my throws. You may think that large sample size may alleviate the issue of unlucky throws because it is far likely to throw 5 heads in a row than 100 with a fair coin, which is a correct assessment. But large sample size will introduce sensitivity issues. For instance, we won’t be surprised if the average of 5 people’s exam score to be 10 percent higher than the population average. But if we take 5,000 people’s exam scores, we would expect them to be pretty close to the population average, if they are taken randomly from the population, even a 1 percent difference may be considered unlikely or statistically significant. Very small errors in either sampling, computation, or just approximation in assumptions can lead to statistically significant results. Therefore, we should not only look at how likely the summary statistic is but also how different the two data sets actually are by using multiple metrics, such as effect sizes.

You know now that p-value is how likely the summary statistics are, such as the average score of two groups of test takers, if the two groups of samples are taken from the same population. The p-test or null hypothesis testing is the practice of using the p-value to either reject or support the null hypothesis. For instance, if we choose a threshold of 0.05 or 5%, then if the p-value is less than the threshold, we claim that we can reject the null hypothesis. The word “reject” is in fact extremely misleading. Many people, including a lot of researchers, take “reject” to mean that we have proven that the null hypothesis to be false. This is absolutely wrong. As I have mentioned before, prove by contradiction is mathematically sound. If I postulated that all swans are white, finding a black swan is enough to show that I was wrong. But if I postulated that most of the swans are white, finding a black swan can’t really prove that I was wrong. Similarly, finding an unlikely dataset under the null hypothesis can’t prove that the null hypothesis is false either. Then if it doesn’t show that the null hypothesis is false, can a low p-value show the null hypothesis is likely to be false then? Again, no. The null hypothesis is either true or false in this universe. Making statements about the likelihood of the null hypothesis is similar to making statements about the likelihood of the existence of gravity, which does not make any physical sense. Furthermore, conditional probability is not symmetric. For instance, the chance of me having coffee given that I am at Starbucks is not the same as the chance of me at Starbucks given that I am having coffee. I could be drinking a lot of coffee at home, but if I visit Starbucks, I would always order coffee. So the chance of me having coffee given that I am at Starbucks is really high, while the chance of me at Starbucks given that I am having coffee is pretty low, given most of the coffee consumptions I have is at home.

And last, even if, we can prove the null hypothesis is false. It still does not mean our original hypothesis is true. For instance, in the smarty drug example, even if we can prove that the two datasets are not taken from the same population, which is the null hypothesis, we still can not conclude that is the smarty drug is the cause of the difference. All that the data has shown is the correlation between the smarty drug and test score. There could be other factors that actually contributed to the difference. For example, it could be that those who can afford the smarty drugs are generally wealthier and therefore has access to better education.

Even though the p-value test is riddled with logical problems, it is rather popular in academics because it is easy to get work published. If we choose a p-value of 0.05 or 5%, which is the common threshold by most journals, it means that 5% of the time, even if the null hypothesis is true, we can get data that passes the p-value threshold. Therefore I just need to test enough hypothesis, doesn’t matter if they are true or not, I can get 1 in 20 published. The chance is further increased if I try different summary statistics or do some data massage, for instance, throw away some outliers or try different analysis strategies to see which one gives me more favorable p-values. Furthermore, even if others can validate my calculation following my steps, but they won’t be able to verify my p-value with any real-world measurement as I have mentioned above that p-value can’t be measured. This practice is commonly referred to as p-hacking. But p-hacking, in fact, is not necessarily resulted from a researcher aiming to game the system, but most of the time due to misunderstanding of the mathematical meaning of the p-test as it was taught in many classes that lower p-value is the evidence to prove that their hypothesis is true. To further the issue, we often can’t see the whole scope of the p-hacking problem because the publication bias. We only see the papers published with the low p-values, but not those studies conducted that show no promises. If I want to test different candies’ impact on test scores, and I tested 20 different candies and one shows up as statistically significant, i.e. a low p-value and I published a paper with that one result while ignored all the other ones. To the public, it would appear that I have only tested one type of candies and got a good result. Therefore, it is easy for them to interoperate that the candy does have a positive effect on test scores. But if they have known the 19 failed tests I have buried, they would be far less impressed by the result and know that one positive result would most likely just by chance.

So I have just painted a pretty bleak picture of the p-value. It is in fact, very problematic in modern scientific journals. More and more scientists have started to agree that there is a reproducibility crisis in science, largely contributed to the misconception and misuse of the p-value testing. But this is not an inherent problem with statistics. We can still use statistics in science. It’s just that we need to use it under the principles of the scientific method, observability, falsifiability, and reproducibility.

A lot of people think that science studies causality. But in fact, we can’t really observe causality. What we can observe is a correlation. We just see event A happened then later B happened, and in a different experiment event A didn’t happen and B also didn’t happen. Instead of using something that we can not measure, or observe, like the p-value, we should design experiments based on standardized repeatable measurements. If three different researchers all have a different definition of efficacy, and they all measure the efficacy of the same drug, but one got 90%, one got 30%, and one got 10%, we can’t really meaningfully conclude much about the drug’s efficacy. If we use statistics in our science experiment, we need to make sure that what we measure, statistically is well defined and can be repeatably measured by different researchers. This means that things that have been only measured once, can’t be called scientific, because we can not know if the measurement is consistent or repeatable.

Regarding falsifiability, researchers should design experiments to test the prediction made with their hypothesis, not doing analysis after data collection with the aim to reject the null. If I believe my smarty drug has a 50% efficacy, I should first define how to measure efficacy. For instance, with two randomized groups of participants, one group takes the drug, the other group takes a placebo and we let them take the same tests after, and use the average score of the smarty drug takers over the average score of the placebo taker as the efficacy. Then I test my predictions to see if the measured efficacy is 50% or higher. We also need to make sure such measurement of efficacy is consistent and repeatable by performing the measurement multiple times with different groups. If the data contradict my hypothesis, then my hypothesis is wrong. But if the data support the hypothesis, we have gained confidence about its validity. Remember, we can never prove our hypothesis to be true.

Finally, and most importantly, is reproducibility. One experiment is not enough. The scientific motto is Nullius in Verba. It is not enough for me to get a positive result. If my hypothesis is correct, other people should get the same result as well. Reproducibility is arguably the most important principle of the scientific method. We can’t be confident about the validity of a scientific theory unless many people can reliably use the theory to make predictions and verify the result. But unfortunately, in the current research culture, reproducibility is often completely ignored. It is a lot easier to get attention and funding by creating a new theory that passed the p-value test once than by trying to verify someone else’s hypothesis. I would love to say that people do science because they want to gain a better understanding of the universe. Many scientists do. But they also need to pay rent and keep their lights on and maintain their research position. As long as the cultural values more headline-worthy science results over sound scientific methods, scientists who are pressured to constantly publish new papers will flood the scientific fields with false results. Many are arguing that it is happening now, and I have to say that I agree. It may sound like great scientific progress when so many papers published every year. But it is actually a disservice to both science and society as a whole when they do not follow the scientific method. If the general public is constantly bombarded with bad science and is not trained to tell the difference between science and pseudoscience, they will question, justifiably, all science results, even those studies done with sound scientific methods, such as climate change, vaccines that have existed for decades, or the shape of the earth. This is a big part of the reason I want to talk about the subject of the scientific method because it is not some obscure subject that only scientists should care about. It influences every one of us in society. When a politician uses scientific research to guide policies, how can we know if that research is well-executed, or bad science serves some agenda? I believe that every human has an epistemological responsibility to do our best to prevent the spread of inaccurate information, as bad information can often lead to bad decisions. But to be able to tell apart science from pseudoscience, we can’t just look at the superficial things, such as the author names or publication journals, but learn to use logic and principles of the scientific method to analyze the study. What does the study propose to measure? have others reproduced its result? Does the measurement deductively support their conclusion? Even then, we will still make mistakes from time to time, and therefore we should keep an open mind to different opinions and willing to accept new evidence and criticism from others, and above all, never stop learning.

To do science is not just about forming up theories based on what we see. It requires using well-defined terminologies, consistent measuring standards, and rigorous logic to form models that not to interpret the past, but to predict the future. So we can design experiments to verify our predictions. Furthermore, science requires a strong skill of communication. It is not only enough for us to get the results we wanted, we also need to communicate with others so that they can also use our terminologies, measuring standards, and models to make their own predictions. Despite the motto Nullius in Verba, science is inherently a form of human collaboration. We need both those who propose the theories and those who reproduce the results to do sound science. We should recognize the work of reproducing results just as valuable as those who publish new theories. Because, without reproducibility, those theories are just pseudoscience at best.

Leave a comment