𝗕𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗩𝗲𝗿𝘀𝗶𝗼𝗻 𝗖𝗼𝗻𝘁𝗿𝗼𝗹 𝗪𝗶𝘁𝗵𝗼𝘂𝘁 𝗗𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗻𝗴 𝗙𝗶𝗹𝗲𝘀
Storing full copies of files for every version or fork wastes space. If you change one line in a project with ten files, you should not save all ten files again.
I faced this problem while building my LaTeX Writer project. I needed a way to handle version control and project forking without high storage costs.
I looked at how GitHub works. GitHub does not store a full repository every time you make a change. It stores content separately and uses references to link files and commits.
I built my system using three main components:
- Metadata: This stores IDs for projects, owners, and folders.
- File Records: These store file names and links to content.
- Blobs: This is where the actual content lives.
The system works through content hashing. When you save a file, the system generates a unique ID based on the content. If the content already exists, the system reuses the existing Blob. It does not create a new one.
This approach makes forking easy and cheap. When you fork a project:
- The system creates a new Project ID.
- It creates new metadata for files and folders.
- It points the new metadata to the existing Blobs.
No actual file content is copied during a fork. You only duplicate the small metadata records.
When you edit a fork, the process stays efficient:
- You change the content.
- The system hashes the new content.
- It creates a new Blob only if that exact content does not exist.
- The metadata for your fork points to the new Blob.
- The original project still points to the old Blob.
This method provides several benefits:
- Content deduplication saves massive amounts of space.
- Forking happens instantly.
- Version management stays organized.
- Database growth stays slow.
You get GitHub-like functionality without the heavy storage overhead.