-
Data Science Project Template Guide
This guide provides a detailed explanation of how to use the Data Science Project Template available at khuyentran1401/data-science-template. The template incorporates best practices for structuring a data science project, making it maintainable and reproducible.
Getting Started
To get started with the template, you first need to install
cookiecutter
:Then, you can download the template using the following command:
You will be prompted to enter some details about your project, and a project with the specified name will be created in your current directory.
Project Structure
The project structure is organized into several directories, each serving a specific purpose:
-
src (Source code): This folder contains the main Python scripts for the project. It typically includes code related to data gathering, data preparation, feature extraction, model training, and evaluation. Each of these tasks could have a separate Python script or module. This organization allows for code reusability and better understanding of the project’s workflow.
-
tests: This directory contains unit tests for the code in the
src
directory. These tests help ensure that the code is functioning as expected and can catch potential bugs or errors. This is an essential part of any software development process, including data science projects. -
models: This directory is used to store trained machine learning models. The model names could be set as
projectname_date_time
orproject_build_id
to keep track of different versions of the models. This is especially useful when you are experimenting with different models or model parameters. -
data: This directory contains all the data used in the project. It is typically divided into
raw
,interim
,processed
, andexternal
subdirectories:raw
: Stores the original, immutable data dump.interim
: This is where intermediate data that has been transformed goes.processed
: This directory holds the final, canonical data sets for modeling.external
: Stores external data, such as data from third-party sources.
-
notebooks: The
notebooks
directory is used for Jupyter notebooks. These notebooks allow for interactive data analysis with Python and can include descriptions, live code, equations, and visualizations. They are often used for exploratory data analysis, data cleaning and transformation, statistical modeling, data visualization, machine learning, and reporting results. -
reports: This directory is used for storing outputs of the project like figures, tables, and other reporting artifacts. It helps to keep all generated reports in one place for easy access and review.
Code Quality Control with Pre-commit
The template uses
pre-commit
to automate the process of checking Python code before committing. This includes checking for a consistent style guide, including docstrings, and more. The hooks executed bypre-commit
are specified in the.pre-commit-config.yaml
file.To install the pre-commit hooks, run:
Now, whenever you run
git commit
, your code will be automatically checked and reformatted before being committed.API Documentation with Pdoc
The template uses
pdoc
to automatically generate API documentation based on the docstrings of your Python files and objects. To generate the documentation, run:The documentation is saved as Markdown files and can be accessed by opening
http://localhost:8080
in your web browser.Conclusion
With this guide, you now have a good understanding of how to use the Data Science Project Template to structure your data science projects. The template is designed to be flexible, so feel free to adjust it based on your applications. Happy coding!
References
-