mk47at: Which programming language do you use? (I've written a magog db diff script in Python this morning. Maybe I've got something that could be useful...)
I'm using Python as well (v2.7). To get the differences between changelogs was easy, since a new version of a changelog will always contain the old version as a sufix, so it's only a matter of, beginning by the end, finding out where the two strings start to differ (and take the new string up until that point as the diff changelog).
Finding additions and removals inside a text is proving to be much more difficult. The most promising solution I found was the second answer in
this Stack Overflow question, which uses difflib.SequenceMatcher. I modified it a little into this:
------------------------------------
s1 = 'whatever1'
s2 = 'whatever2'
def isjunk(string):
"Return True if we don't care about this string"
....return string == ' '
s = difflib.SequenceMatcher(isjunk)
s.set_seqs(s1, s2)
removed = ''
added = ''
for (opcode, s1start, s1end, s2start, s2end) in s.get_opcodes():
....if opcode == 'equal':
........removed += s1[s1start:s1end]
........added += s2[s2start:s2end]
....elif opcode == 'insert':
........added += '[bbb]' + s2[s2start:s2end] + '[/bbb]'
....elif opcode == 'delete':
........removed += '[bbb]' + s1[s1start:s1end] + '[/bbb]'
....elif opcode == 'replace':
........removed += '[bbb]' + s1[s1start:s1end] + '[/bbb]'
........added += '[bbb]' + s2[s2start:s2end] + '[/bbb]'
------------------------------------
(Dots added to preserve indentation, and [bbb] BBCode tags used instead of the regular bold tag to prevent them being interpreted as bold text here in the forum)
It's supposed to generate two new strings (removed and added), which respectively have in bold type whatever appears in s1 but not in s2, and whatever appears in s2 but not in s1.
As I said, this works pretty well for certain strings. E.g., the languages change in the
This War of Mine: The Little Ones update
here:
------------------------------------
s1 = 'Text only: Brazilian-Portuguese, German, English, Spanish, French, Italian, Japanese, Korean, Polish, Russian, Turkish'
s2 = 'Text only: Brazilian-Portuguese, Chinese, German, English, Spanish, French, Italian, Japanese, Korean, Polish, Russian, Turkish'
removed = 'Text only: Brazilian-Portuguese, German, English, Spanish, French, Italian, Japanese, Korean, Polish, Russian, Turkish'
added = 'Text only: Brazilian-Portuguese,
Chinese, German, English, Spanish, French, Italian, Japanese, Korean, Polish, Russian, Turkish'
------------------------------------
OK, the addition of the Chinese language has been correctly detected. However, taking the language strings in the
Master of Orion: Collector's Edition [In Dev] update
here:
------------------------------------
s1 = 'Audio and text: English, Russian. Text only: Czech, German, Spanish, French, Japanese, Korean, Polish, Portuguese, Turkish'
s2 = 'Audio and text: English, Russian. Text only: Brazilian-Portuguese, Czech, German, Spanish, French, Japanese, Korean, Polish, Turkish'
removed = 'Audio and text: English, Russian. Text only:
Cz
ech, German
, Spanish, French, Japanese, Korean, Polish, Portuguese, Turkish'
added = 'Audio and text: English, Russian. Text only:
Braz
ilian
-Portuguese,
Czech, German, Spanish, French, Japanese, Korean, Polish, Turkish'
------------------------------------
Fail... Of course, this is strictly correct (i.e. by replacing what's bolded in 'removed' for what's bolded in 'added' you'd obtain the s2 string, but of course that's not what I want. I guess I need to tell the SequenceMatcher to only consider whole words, but dunno how... (in fact I thought the 'isjunk' function served that purpose)