How to organise your data science projects in GitHub
A clear blueprint for structuring data-science research repositories, plus the GitHub workflow I use to collaborate with researchers.
Data science has always been a passion of mine. The use of data and the intersection of mathematics, statistics, computer science, and specific domains — such as biology or finance — is fascinating. The transdisciplinary nature of this field also introduces real complexity to project management. It took me a long time to work around it, and when collaboration came in, the complexity scaled up proportionally. What saved me was borrowing concepts from software engineering. Once I applied them, managing my data-science projects started to flow naturally.
What you’ll take away from this post
- A clear blueprint for enhancing productivity and improving project management in research.
- The main GitHub strategies that helped me manage transdisciplinary data-science projects.
I learned everything below from building web applications with a small team of friends to help entomologists during COVID-19. It was a fantastic opportunity to coordinate with university stakeholders, students, professors and collaborators, and to ship a long-lasting tool for the Department of Entomology and Acarology at the University of São Paulo.
The four questions that forced me to design a system
After several meetings with my collaborators, I kept hitting the same practical problems. They started as questions I could not answer well:
- How can I easily find the functionalities I created in my own projects?
- How can I optimise naming and directory layout so collaborators can understand what is going on?
- How can I share progress and track changes in our work?
- How can I be safe — travel back to earlier versions of the project when something breaks?
A few years in, my team and I had a system to manage our web-development projects, and we still use it today. After realising it adapts beautifully to classical data-science research, I started using the same approach in every research project. Here it is.
The ultimate data-science project structure (for me)
It is the ultimate for me — I am sure you can design something better. I call it that because, given the current demands of my work, I do not feel any need to change it. If you have suggestions, please let me know. Here is the skeleton I use for data-science research projects:
├── 01_data_preprocessing
│ └── 011_data_preprocessing.r
├── 02_machine_vision_features
│ ├── 021_feature_extraction.py
│ └── 022_data_augmentation.py
├── 03_learning_algorithms
│ ├── 031_CNN_performance.py
│ └── 031_Machine_learning_algorithms_performance.py
├── 04_data_visualisation
├── Plots
├── input_data
│ └── images
├── output_data
│ └── results.csv
├── 00_source.rThe core idea is to align directories with the main actions in your project. In the example above, the goal was to identify species using classical machine learning and deep-learning methods. The first step was to prepare the dataset, so all the related functions live in 01_data_preprocessing. Then we extracted features with machine-vision techniques in 02_machine_vision_features. The learning algorithms used to classify the images live in 03_learning_algorithms. Finally, the data-visualisation code lives in 04_data_visualisation.
This structure has a wonderful side-effect for science: it mirrors the methodology section of a paper. A reader can match a figure or a result back to the script that produced it without going on a directory expedition.
The remaining directories — input_data, output_data, and Plots — do exactly what their names suggest: raw inputs, processed outputs from your analysis, and the visualisations you generated. A single 00_source.r (or 00_source.py) acts as the entry point that wires everything together. The whole layout works equally well in R and Python projects.
GitHub — the ultimate tool for data-science collaboration
I love GitHub. I learned most of what I know by working with Laravel and studying with that community of web developers (if you want to go deeper, Laracasts is excellent). My first GitHub repository came when I was an undergrad in Biology — nobody taught me, I just got curious and tried to solve problems with friends.
After creating a GitHub account and connecting it to your machine using the official Set up Git guide, you can create a new repository and link your local project to it. The Adding locally hosted code to GitHub docs walk through this exact path. Once your repository is online, the real fun begins — GitHub Projects.
GitHub is not just version control. For a research team, it is the most underrated piece of project-management software in academia.
The Kanban workflow I use with collaborators
You can spin up a new Project from inside the repository — the planning and tracking docs are the deepest reference. I almost always pick the Kanban view and use five columns: Backlog, Ready, In Progress, In Review, and Done. From there it is just issues and milestones.
Backlog holds everything we collected in our first brainstorming session — the wishlist of actions for the project. Ready is what we have agreed to actually start. In Progress is whatever is open right now. In Review is where the collaboration multiplies: another team member checks the work before it moves on. Done is the column I love most — getting things done.
Closing thoughts
I hope this gave you an alternative way to organise your data-science projects and that it helps you answer the four questions I asked at the beginning. In the next part of this series I’ll go deeper into the design of issues and milestones — how to slice them so that they actually pull work forward instead of becoming a to-do museum. If you have questions in the meantime, please send them my way.
All the best,
Gabriel
Audit your data-science environment
A quick assessment of how you currently run analyses, organise code, and collaborate. At the end you get a personalised set of next steps you can apply to your repository this week.
From the post: How to organise your data science projects in GitHub
