df <- read_csv("../data/nootroflix_ratings_users_clean.csv")
In this part, I present the methodology I’ll use in all this post.
First, some changes that happened to Nootroflix, that we have to take into account:
In the first version of Nootroflix (for which there are the most ratings), the default answer for “issues with this nootropic” was “None / Unsure”, and you could only select one issue.
In the subsequent versions, multi-selection was possible, and the default answer was just no selection (but you could select “None/Unsure”)
For the first version, we restrict our analysis to users with enough ratings (>15) and at least one nootropic for which they entered an issue (because a lot of people weren’t entering issues). This probably is an underestimate, because some people might have entered one issue but not all issues they’ve had.
For the second version, we can have upper and lower bounds for the probability of an issue, as well as an unbiased (but very noisy) estimate:
lower bound: number of people who reported this issue for this nootropic / number of people who tried this nootropic. To tighten the bound, we restrict our analysis to people who rated enough nootropics, and reported at least one issue.
upper bound: number of people who reported this issue for this nootropic / number of people who answered the issue question for this nootropic (no empty selection). This is probably an upper bound because you’re more likely to answer the “issues” question if you had an issue. To tighten the bound, we can restrict ourselves to users who have entered “None/Unsure” as an issue for at least one other nootropic.
unbiased estimate: Some users (169, with 1830 ratings) were kind enough to answer the issue question for every rating they made. If we restrict ourselves to these users, we should an unbiased estimate, assuming these users aren’t too unusual.
Let’s see what it gives us for estimating the probability of addiction:
df_new <- df %>%
filter(str_detect(issue, "\\[")) %>%
mutate(issue = str_remove(issue, "\\["),
issue = str_remove(issue, "\\]"),
issue = str_remove_all(issue, "\\'"),
issue = str_split(issue, ",")) %>%
unnest(issue) %>%
mutate(issue = case_when(
issue == "" ~ "unknown",
str_detect(issue, "None") ~ "none",
str_detect(issue, "Developed addiction") ~ "addiction",
str_detect(issue, "Developed tolerance") ~ "tolerance",
str_detect(issue, "Other issues") ~ "other",
str_detect(issue, "Had to stop because of side effects") ~ "side_effects",
str_detect(issue, "Persistent side effects") ~ "long_term_side_effects"))
df_old <- df %>%
filter(!str_detect(issue, "\\[")) %>%
mutate(issue = case_when(
issue == "" ~ "unknown",
str_detect(issue, "None") ~ "none",
str_detect(issue, "Developed addiction") ~ "addiction",
str_detect(issue, "Developed tolerance") ~ "tolerance",
str_detect(issue, "Other issues") ~ "other",
str_detect(issue, "Had to stop because of side effects") ~ "side_effects",
str_detect(issue, "Persistent side effects") ~ "long_term_side_effects"))
Using data from nootroflix first version:
users_filling_issues <- (df_old %>%
group_by(userID) %>%
filter(n() > 15) %>% #enough ratings
ungroup() %>%
filter(issue != "none") %>% #at least one not None
select(userID) %>%
distinct)$userID
estimate_old <- df_old %>%
filter(userID %in% users_filling_issues) %>%
group_by(itemID, issue) %>%
summarise(count = n()) %>%
mutate(count_total = sum(count)) %>%
mutate(variant = "firt version")
Using data from Nootroflix second version:
estimate_new_lower <- df_new %>%
filter(userID %in% (df_new %>%
group_by(userID) %>%
filter(n() > 15) %>% #enough ratings
ungroup() %>%
filter(issue != "unknown") %>% #at least one issue entered
select(userID) %>%
distinct)$userID) %>%
group_by(itemID, issue) %>% #assume only one rating set per user
summarise(count = n()) %>%
mutate(count_total = sum(count)) %>%
mutate(variant = "second version, lower")
estimate_new_upper <- df_new %>%
filter(userID %in% (df_new %>%
group_by(userID) %>%
filter(n() > 15) %>% #enough ratings
ungroup() %>%
filter(issue == "none") %>% #at least one time has entered none
select(userID) %>%
distinct)$userID) %>%
group_by(itemID, issue) %>% #assume only one rating set per user
summarise(count = n()) %>%
mutate(count_total = sum(count)) %>%
pivot_wider(names_from = issue, values_from = count) %>%
mutate_all(~replace_na(., 0)) %>%
mutate(count_total = count_total - unknown) %>%
pivot_longer(cols = c("none", "other", "side_effects", "addiction", "long_term_side_effects", "tolerance"), values_to= "count", names_to = "issue") %>%
mutate(variant = "second version, upper")
estimate_new_unbiased <- df_new %>%
group_by(userID) %>%
mutate(n_issues = sum(issue != "unknown")) %>%
filter(n_issues == n()) %>% # we restrict ourselves to users who answered the issue question for every rating
ungroup() %>%
group_by(itemID, issue) %>% #assume only one rating set per user
summarise(count = n()) %>%
mutate(count_total = sum(count)) %>%
mutate(variant = "second version, unbiased")
The three estimates seem to give coherent results:
estimate_new_upper %>%
bind_rows(estimate_new_lower) %>%
bind_rows(estimate_new_unbiased) %>%
bind_rows(estimate_old) %>%
mutate(nootropic = itemID) %>%
mutate_all(~replace_na(., 0)) %>%
filter(count_total > 0) %>%
rowwise() %>%
mutate(prop = prop.test(count, count_total, conf.level=0.95)$estimate,
prop_low = prop.test(count, count_total, conf.level=0.95)$conf.int[[1]],
prop_high = prop.test(count, count_total, conf.level=0.95)$conf.int[[2]]) %>%
ungroup() %>%
filter(issue == "side_effects") %>%
mutate(nootropic = str_sub(nootropic, 1, 25)) %>%
mutate(nootropic = fct_reorder(nootropic, prop)) %>%
filter(prop_high - prop_low < 0.6) %>%
group_by(nootropic) %>%
filter(n() == 4) %>%
ungroup() %>%
#filter(nootropic %in% sample(levels(nootropic), 5)) %>%
#group_by(nootropic) %>%
#mutate(prop_mean = mean(prop)) %>%
#ungroup() %>%
#filter(rank(-prop) < 15) %>%
ggplot() +
geom_pointinterval(aes(x = prop, xmin=prop_low, xmax=prop_high, y=nootropic, color = variant), position=position_dodge2()) +
scale_x_log10() +
xlab("Probability of side effects") +
ylab("")