𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗶𝗻𝗴 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗦𝗰𝗿𝗲𝗲𝗻𝗶𝗻𝗴

📅1 week ago⏱1 min read

Manual paper screening takes too long. Independent scientists waste weeks on this. You need a faster way.

The goal is high recall. You do not want to miss relevant papers. Train a model to label papers as include or exclude. This creates a discard pile.

Use a scikit-learn pipeline. Use a TF-IDF vectorizer. Use Logistic Regression. Set the threshold to 0.95 recall.

Follow these steps:

Make a dataset. Use a spreadsheet. Record titles, abstracts, and labels.
Train the model. Use scikit-learn. Set max features to 5000. Use 1 to 2 n-grams.
Check the results. Sample the exclude pile. Ensure no good papers are there.

This process saves time. You spend more time on analysis and less on sorting.

Continue reading