Using multiple Git repositories instead of a single one containing many apps from different teams?
I am migrating a 10-years-old big CVS repository to Git. It seemed obvious to split this multiple-projects repository into several Git ones. But the decision-makers are used to CVS, therefore their point of view is influenced by CVS philosophy.
To convince them to migrate from one CVS repo to different Git repositories I need to give them some arguments.
When I speak with mates working on Git repo for years, they say that using multiple Git repo is the way to use Git. I do not know really why (they give me some ideas). I am a newbie in this field so I ask here my question.
What are the arguments to use multiple Git repositories instead of a single one containing different applications and libraries from different teams?
I have already listed:
- branches/tags impact the whole Git repository files => pollutes other team projects
- 4GB limit Git repo size but this is wrong
- git annotate may be slower on bloat Git repo...
- Eamon Nerbonne has noticed the related question:
Choosing between Single or multiple projects in a git repository?
- The reason the team managers finally have accepted the split: the single Git repo (550 MB) was requiring 13 minutes to be cloned on Windows (one minute on Linux).
- The bloat CVS repo split in 100 Git repositories:
- each dead apps in one repo
- each stabilized library in one repo (source code almost never changed any longer)
- related apps/libs kept together in one repo
- moved large files not used for compilation (config...) to other repos (Git does not like large files)
- skipped other unrelevant files (
- Successfully installed the
repotool used by Android Open Source Project in order to handle all these Git repos:
- easy installation on Linux
- more difficult on Windows because of Cygwin and NTFS native symlinks requirements
@JimMartens: **What are the reasons Git users usually use several Git repositories while CVS users use a single one?** (one Git repo per deliverable module?) Thanks for your comment, I may change the question explanation... Finally I changed the title ;-)
Do they? We used to have different SVN repositories for different projects.
@LeonardoHerrera: On the unique CVS repo, there are main directories for each *group of applications*, then sub-directories for each team project, each team developing/maintaining different libs and apps. Actually, there are also historic-reason-garbage directories and common directories... Some *group of applications* have already moved from CVS to SVN (one SVN repo per group). I am migrating one of the other *groups of applications* that was thinking to migrate to a unique Git repository.
@LeonardoHerrera It's not just a CVS/SVN difference, either - we use one massive SVN repo with thousands of directories at the root level, including a lot of dead projects. It's just an IT/business decision what layout to use.
@olibre: I'd avoid multiple repos unless you actually need it. When you do, it's not hard to split repos.
Thanks for the link :) To manage multiple Git repos I successfully installed the `repo` tool used by Android Open Source Project on Linux (easy) and Windows (more difficult because of Cygwin and NTFS native symlinks requirements). But you are right, to make the developer life pleasant I have kept together some related projects. Cheers ;)
You're dealing with multiple teams and multiple projects. Likely decades of work went into the codebase.
The short answer is that your teams and projects have varying needs and varying dependencies.
The monolithic repository approach reduces commits to "Everything is stable in this configuration!!!" (i.e. unrealistic, huge commits sourced from many teams). That, or many intermediate points of incompatibilities for many projects. Either way, there's a lot of wasted energy invested in supporting configurations which were simply never meant to be.
Your repositories should be structured independently instead, and should have multiple repositories which represent their dependencies. The dependencies should be configured, updated, and tested by the project maintainers at appropriate points in development.
- ProjectA saw its last major release 3 years ago. It is in maintenance mode and has "older" system requirements. It should refer to an appropriate set of dependencies. It has 20 dependencies.
- ProjectB was just released. It has more modern System Requirements, and was developed and tested by another team. It has 15 dependent libraries (=repos), 10 of which are shared with ProjectA. These projects generally refer to different commits of their dependent libraries. Dependencies are updated at appropriate points in development.
- ProjectC is yet to be released. It's very similar to ProjectB, but includes significant changes and improvements to its dependencies. Developers of ProjectB are only interested in taking the stable releases of the dependencies they share with ProjectC. ProjectB's team makes some commits to the shared dependencies, although they are mostly bugfixes and optimizations at this time. A monolithic repository would either hold development of ProjectC back in order to maintain support for ProjectA, or ProjectC's changes would break A and B, or developers would just end up not sharing/reusing code.
With multiple (distributed) repositories, each team can work independently and minimize impacting the other projects while reusing and constantly improving the codebases. This also keeps teams from shifting focus/speed when changes come in from other teams. The centralized monolithic repository makes each team dependent on every team's move, and that would have to be synchronized.
I am reviewing our number of repositories. A compelling advantage of monorepo is that truly global files like system configuration are versioned along with all other projects. Thus a monorepo would contain the old configuration to support Project A.
What you describe here is exactly the opposite of what Google does. (That paper describes why they feel that, even with two billion lines of code, the monolithic approach works far better.)
@CurtJ.Sampson Well… Google has built an in-house revision control system, several supporting utilities, build systems, has a compiler team, etc. all of which support their workflow. My shop doesn't have those tools or teams. Unless you work for Google, you don't have those tools either. Google's Android codebase uses Git; they use over 800 Git repositories. This question is specific to Git, right? FWIW I typically structure my repos by A) a large-ish repo with several shared libraries B) various Application/Project repos which depend on A and C) separate repos with various media assets.
I brought up Google as an example of a company that clearly has a very high level of technical skill and found having a single repo so compelling that they were willing to spend literally millions of dollars to create the unique system they needed to do that. If someone doesn't understand at all why they would put in that kind of effort, I hope it's obvious that they should be thinking not "Google's wrong" but "there's something I'm missing here." But ask I think about this more, I think I need to add an answer to argue my opinion at appropriate length.
There doesn't seem to be an argument in favor of the big repo in this thread, so here's one:
The advantage of a big repo with all your code in it, is that you have a reliable source of truth. All the state in your overarching project is represented in that repo's history. You don't have to worry about questions like "What version of libA do I need to build libB from 3 months ago?" or "Did the integration tests start failing because of Susan's change in libC or Bob's change in libD?" or "Are there any callers left for evilMethod()?" It's all in the history.
When related projects are split into separate repos, git isn't keeping track of their relationships for you. Your build system needs to know where to find the code for all its dependencies, and more importantly what version of the code to build. You can "just build everything from master", but that makes it hard to reproduce past builds, hard to make changes (or rollbacks) that need to be synced across repos, and hard to keep branches in a stable state.
So the question isn't "One big repo or many small repos?" It's actually "One big repo or many small repos with tooling." What tool are you going to use? Repo (Android) and gclient (Chromium) from Google are two examples. Git submodules is another. All of those have major downsides that you have to weigh against the downsides of a big repo.
Edit: Here are some more answers Choosing between Single or multiple projects in a git repository?
PS: All that said, I'm working on a tool to hopefully make things better, for when you have to split repos or use other people's code: https://github.com/buildinspace/peru
Totally agree that arbitrarily splitting a repo into multiple repos is not necessarily good. Version control ambiguity: there is no 1 source of truth from which you could build the total project - Maintainability: You'd have to write documentation specifically to tell people how the projects relate to one another and how to compile and merge them together - Workflow: if you regularly make changes in one repo that you then deploy from another repo you are continually having to copy builds from one project to another which seems completely unnecessary compared to them being in the same repo.
One repository plus a Dockerfile providing a shared development environment that just works out of the box for the whole project saves A LOT OF TIME AND FRUSTRATION. Plus the setup is extremely easy for a new developer and he/she doesn't need to know many details about how everything works compared with n repositories where you need to know what version of x works with what version of y plus you need to put this folder here with these file permissions edit 10 configuration files to have the login page after one day of work because ah this library is not available on Debian Wheezy.
I have started to feel the pain in multiple repos when it comes to code sharing. Since there are many different solutions with public code, you can never be 100% sure that some piece of code is not being used anymore. This creates severe maintainability issues. This goes along with what you mentioned with the `evilMethod()` there.
Git tends to experience performance problems when used with large repositories.
To quote Linus:
And git obviously doesn't have that kind of model at all. Git
fundamnetally never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around.
So git scales really badly if you force it to look at everything as one huge repository. I don't think that part is really fixable, although we can probably improve on it.
Emphasis mine. That's not to say your company's version control repository is "large," but this is one reason people tend to avoid large repositories within Git.
Thanks for the link ;) This is the reason why Google limits the code.google.com Git repositories to 4 GB. Cheers
It's a performance optimization, but I'm not sure if it's *only* a performance optimization. Per AProgrammer's answer it's also a way to represent the notion of a "module" in git.
Here we need to make it clear: few quite big solutions are implemented using Git as the source code repository (Android, Linux). That is solid confirmation Git could be used with large repositories. Though that DOES require additional technics
Git also does not support a lot of things that improve the ability to have huge repositories. That works nicely in SVN but not in git where the main supported way of working is to checkout an entire repository, not only all folders but all history. It's harder in git to accomplish as much such as sharing only one folder or using externals (which allow internal references to the same repo).
They want [something that can] show their changes across all projects instead of trying to remember what project they made a change [to].
Sourcetree (a free-as-in-beer GUI Git frontend) allows you to register multiple repositories, organise them into logical groups, and then view status across all of them at once:
I am not affiliated with them in any way.
@CADbloke how is the interface for SmartGit nowadays? I remember a year or two back I was deciding between the two, and SourceTree was clearly easier to use than SmartGit, which was rather spartan and problematic. I couldn't even get my diff tool to work well
I like it. SourceTree went off the rails in recent times. I paid for Smartgit & use it for Hg too. I like the diff tool in it. They are pretty similar but I think Smartgit made more progress lately.
Have you used Beyond Compare, I find this diff tool better than the others? How does it compare?
I recognize this is pretty old but either Sourcetree has gone backwards or I just can't find that left hand overview pane. I have the mid-left one (File status, Branches, etc.) that is per repo, just not the one for multiple repos. Any ideas how I get it (or is it no longer or?)
@JoelAZ They've since changed to a tabbed interface. If you click the '+' button in the tab bar, you'll get to what used to be the overview pane. I'm not convinced it's an improvement.
@RonMacNeil OH geez. Is THAT the equivalent of that pane? ffs. No not an improvement at all. Better to have left that pane, made it pinnable and/or slide/hideable imo if some ppl wanted the real estate back but burying it on another separate screen does not seem well thought out. Thx for replying.
TL;DR; the equivalent of a git repository is a CVS module, not a CVS repository.
CVS is designed with a notion of modules being a subdivision of a repository, and it is common to use CVS repositories with several modules having a quite independent life. As an example, it is easy to have branches specific to one module and not present in another.
git has not be designed with a notion of module, each git repository is limited to one module in CVS term. When you create a branch, it is valid for the whole repository.
Thus, if you want to import a CVS repository with several modules in git, you better create a repository per module, especially if the modules have a more or less independent life and don't shares things like branch and labels. (Due to different usage patterns of branches in CVS and git, you may even investigate the usefulness to have one repository per CVS branch; but for a migration from CVS to git, it is probable that your workflow will at the start be similar enough to a CVS workflow that it doesn't worth the pain).
If you're willing to play ball with them in order to appease, you could set it up this way. Or this method. Other than that, I think they're just expecting a single point of entry into the system to access assets.
Depending on access needs, the separated GIT repos might still be the better way to go, as "John Smith" may need access to certain data, but not other data. While "Suzy Que" may be a sys admin needing access to everything.
If you go with a single repo, you might run into issues with your internal access requirements. If it's an "everybody gets full access" type of thing, then I could possibly see their point of view.
At present, each developer has access to any source code within the unique CVS repository. So they do not care about developers having access to any source code within a unique Git repo. But this is a good advantage to switch to multiple Git repositories, thanks ;-) However I did not understand what you mean about your branching models as each application has its own branches/tags. Cheers
Hi Will. After reading again your answer I understand better what you mean (I mean the last two paragraphs). This is a good point to underlying +1 Please could you give more explanation on your first paragraph instead of putting *this* links? Or you may remove your first paragraph... ;) Cheers
Git migration help page from Eclipse suggests to reorganize the CVS/SVN directory tree into multiple Git repositories:
Now is a great time to refactor your code structure. Map the current CVS/SVN directories, modules, plugins, etc. to their new home in Git. Typically, one Git repository (.git) is created for each logical grouping of code -- a project, a component, and so on.
The trade-off here is that each additional Git repository adds extra overhead to your development process - all Git commands and operations happen at the level of a single Git repository. On the flip side, each repository user will have an entire copy of the repository history, making very large repositories cumbersome to work with for casual contributors.
Git operates on a whole tree at once, not on just the subdirectory you're on.
Let's say you have your project at
And let's say these two files have changed:
When you're in the project root and you do a git status, you'll see these files as having changed:
cdto the MoreStuff directory, though, will you see only the morestuff.txt file? No. You'll still see both files, relative to your position:
As a result, if you lump all your projects together in one big ol' Git repository, then every time you go to check in, you'll have to pick from among changes in every project.
Now there could be ways to mitigate that; for instance, you can try to make sure you at least temporarily commit your changes before switching to work on a different project. But that's a lot of overhead that each person on your team would have to deal with, compared to simply doing it the right way: one Git repository per project.
Thanks for this argument, I agree ;) But most of the developers used to CVS do not agree, they want this behavior: They want that WinCVS show their changes across all projects instead of trying to remember what project they made a change (if you change several files on many directories about the same fix, you risk to forget to commit one of these files). The `repo` tool from AOSP allow to mix multiple Git repositories and the `repo status` showing changes across all the projects ;) Cheers