Mergiraf: a syntax-aware merge driver for Git Hackernews Viewer

Mergiraf: a syntax-aware merge driver for Git

427 points by p4bl0 9 November 2024 | 87 comments

Comments

DarkPlayer 9 November 2024

Looking at the architecture, they will probably run into some issues. We are doing something similar with SemanticDiff [1] and also started out using tree-sitter grammars for parsing and GumTree for matching. Both choices turned out to be problematic.

Tree sitter grammars are primarily written to support syntax highlighting and often use a best effort approach to parsing. This is perfectly fine for syntax highlighting, since the worst that can happen is that a few characters are highlighted incorrectly. However, when diffing or modifying code you really want the code to be parsed according to the upstream grammar, not something that mostly resembles it. We are currently in the process of moving away from tree-sitter and instead using the parsers provided by the languages themselves where possible.

GumTree is good at returning a result quickly, but there are quite a few cases where it always returned bad matches for us, no matter how many follow-up papers with improvements we tried to implement. In the end we switched over to a dijkstra based approach that tries to minimize the cost of the mapping, which is more computationally expensive but gives much better results. Difftastic uses a similar approach as well.

[1]: https://semanticdiff.com/

Game_Ender 9 November 2024

The tool has an excellent architecture section [0] that goes into how it works under the hood. It stands out to me that a complex tool has an overview to this depth that allows you to grasp conceptually how it works.

0 - https://mergiraf.org/architecture.html

chrismorgan 9 November 2024

Going through the sorts of conflicts it solves, and limitations in that, I find it claiming that in some insertions, order doesn’t matter <https://mergiraf.org/conflicts.html#neighbouring-insertions-...>.

I really don’t like that. At the language level, order may not matter, but quite frequently in such cases the order does matter, insofar as almost every human would put the two things in a particular order; or where there is a particular convention active. If you automatically merge the two sides in a different order from that, doing it automatically has become harmful.

My clearest example: take Base `struct Foo; struct Bar;`, then between these two items, Left inserts `impl Foo { }`, Right inserts `struct Baz;`. To the computer, the difference doesn’t matter, but merging it as `struct Foo; struct Baz; impl Foo { } struct Bar;` is obviously bad to a human. This is the problem: it’s handling language syntax semantics, but can’t be aware of logical semantics. (Hope you can grasp what I’m trying to convey, not sure of the best words.) Left was not inserting something between Foo and Bar, it was attaching something to the end of Foo. Whereas Right was probably inserting something between Foo and Bar—but maybe even it was inserting something before Bar. You perceive that these are all different things, logically.

Another example where this will quickly go wrong: in CSS rulesets, some will sort the declarations by property name lexicographically, some by property name length (seriously, it’s frequently so pretty), some will group by different types of property… you can’t know.

nathell 9 November 2024

‘Why the giraffe? Two reasons. First, it can see farther due to its height; second, it has one of the biggest hearts of all land mammals. Besides, its ossicones make you believe it listens to you when you look at it.’ – My NVC teacher

Kudos for the nonviolence. :)

lucasoshiro 9 November 2024

Happy to see something being developed for merge drivers, they are a underrated Git feature that could save a lot since the standard three-way merge of file contents is not aware of the language and can create some problems. For example, if you have this valid Python code:

x = input()

if x == 'x': print('foo')

    print('bar')

If you delete the first print in a branch, delete the other print in another branch, then merge the two branches, you'll have this:

x = input()

if x == 'x':

Both branches delete a portion of the code inside the if block, leaving it only with a whitespace. In Python it is not a valid code, as they empty scopes need to be declared with pass.

I installed Mergiraf to see if it can solve this situation, but sadly, it doesn't support Python...

erik_seaberg 9 November 2024

Syntax-aware tools always have issues when a team extends the base language to fit their problem. Rust has macros. People started using "go generate" for stuff like early generics. Does Mergiraf take EBNF or plugins or does a team fork it to explain their syntax?

leonheld 9 November 2024

I'll certainly give it a try. Another tool I've been using (with varied degree of success) to enhance my git life is https://github.com/tummychow/git-absorb. If both of these worked flawlessly or maybe even officially incorporated in git, I'd be very happy.

IshKebab 9 November 2024

This sounds great. To be honest though none of the merge tools really give me enough information to resolve all conflicts easily.

The best I've got to is zdiff3 in VSCode (not using their fancy merge view which I don't understand at all). But it's missing:

1. Blame for the merge base.

2. Detection of the commit that introduced the first conflict.

3. Most annoyingly, no way to show diffs between the "current" and "incoming". IIRC it has buttons to compare both of those to the merge base, but not to each other. That often leaves me visually scanning the text to manually find differences like a neanderthal. Sometimes it's annoying enough that I copy & paste current/incoming into files and then diff those but that's a right pain.

jappgar 9 November 2024

This seems like a really cool idea that would help with a scenario I encounter a lot with conflicts related to auto-formatting. Sometimes a small change can lead to a lot of whitespace changes below (in functional chains in js, for example).

Can this also detect some scenarios where semantic conflicts (but not line conflicts) arise, usually due to moved code?

I don't know the exact circumstances when this happens, but occasionally you can have e.g a function defined twice after two branches both move the same function elsewhere.

manx 10 November 2024

Great to see more work in automatic git conflict resolution! I'll definitely give this a try. My own attempt at such a tool involved character based diffing and patching. It is able to solve most trivial conflicts that git cannot: https://fdietze.github.io/blend/

I never found the time to make a cli out of this.

_flux 9 November 2024

Python support would certainly be seem useful for this, in particular as its intentation-based AST should play nicely with this.

donatj 9 November 2024

Neat idea for sure. Language support is pretty limited right now, hopefully there's support for more in the works.

fuzzy2 9 November 2024

I'm eager to try this. Seems like it could revive the genre after Semantic Merge died.

DrBenCarson 9 November 2024

How is this better than Difftastic? https://github.com/Wilfred/difftastic

ctenb 9 November 2024

Would it be possible to make this work with a treesitter grammar?

cool-RR 9 November 2024

I'm flummoxed at the lack of Python support.

jay-anderson 9 November 2024

Nice to see lilypond in the example.

pknopf 9 November 2024

Can LLMs help here?