Home > mailing lists

Re: PostgreSQL Developer meeting minutes up - Mailing list pgsql-hackers

From	Markus Wanner
Subject	Re: PostgreSQL Developer meeting minutes up
Date	May 29, 2009 03:41:13
Msg-id	20090529084109.14871sskioiu9gud@mail.bluegap.ch Whole thread Raw
In response to	Re: PostgreSQL Developer meeting minutes up (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: PostgreSQL Developer meeting minutes up
List	pgsql-hackers

Tree view

Hi,

Quoting "Robert Haas" <robertmhaas@gmail.com>:
> That's not the best news I've had today...

Sorry :-(

> To me they sound complex and inconvenient.  I guess I'm kind of
> mystified by why we can't make this work reliably.  Other than the
> "broken tags" issue we've discussed, it seems like the only real issue
> should be how to group changes to different files into a single
> commit.  Once you do that, you should be able to construct a
> well-defined, total function f : <cvs-file, cvs-revision> -> <git
> commit> which is surjective on the space of git commits.  In fact it
> might be a good idea to explicitly construct this mapping and drop it
> into a database table somewhere so that people can sanity check it as
> much as they wish.  Why is this harder than I think it is?

Well, as CVS doesn't guarantee any consistency between files, you end
up with silly situations more often than you think. One of the
simplest possible example is something like:
  commit 1: fileA @ 1.1, fileB @ 1.2  commit 2: fileA @ 1.2, fileB @ 1.1

Seen from fileA, it's obvious that commit 1 (@1.1) comes before commit
2 (@1.2), but seen from fileB it's the exact opposite. The most
promising approach to solve these problems seems to be based on Graph
Theory, where you work with a graph of dependencies from fileA @ 1.1
to fileA @ 1.2.

To resolve the above situation, you'd have "split" a blob of
single-file commits into two end-result commits (for monotone / git).
In the above example, you'd have two options to resolve the conflict:
  commit 1a: fileA @ 1.1  commit 2:  fileA @ 1.2, fileB @ 1.1  commit 1b: fileA @ 1.2

Or:
  commit 2a: fileB @ 1.1  commit 1: fileA @ 1.1, fileB @ 1.2  commit 2b: fileB @ 1.2

(Note that often enough, these have actually been separate commits in
CVS as well, there's just no way to represent that. And no, timestamps
are simply not reliable enough).

Now add tags, branches and cyclic dependencies involving many files
and many 100 commits to the example above and you start to get an idea
of the complexity of the problem in general.

See my description and diagrams of the steps used for cvs_import in
monotone at [1] or follow descriptions of how cvs2svn works internally.

A few numbers about a conversion I'm trying for testing my algorithm
and heuristics. It's converting a pretty recent snapshot of the
Postgres repository:
 * running at 100% CPU time since: April, 17 * Total number of files involved: 6'847 * total number of blobs (before
splitting):28'010 * blobs split due to cyclic dependencies: 12'801 

Admittedly, my algorithm isn't optimized at all. However, I'm focusing
on good results rather than speed of conversion.

Also note, that monotone uses SQLite, so it actually stores the
results of this conversion in an SQL database, as you proposed.
Recently, a git_export command has been added, so that's definitely
worth a try for converting CVS to git. However, I fear cvs2git is more
mature.

Regards

Markus Wanner

[1]: a description of the various steps in conversion from CVS to monotone:
http://www.monotone.ca/wiki/CvsImport/

pgsql-hackers by date:

From: "Markus Wanner"
Date: 29 May 2009, 02:53:24
Subject: Re: PostgreSQL Developer meeting minutes up

From: Peter Eisentraut
Date: 29 May 2009, 03:50:34
Subject: Re: Unicode string literals versus the world

Re: PostgreSQL Developer meeting minutes up - Mailing list pgsql-hackers

Previous

Next