overview -- revisions and the piece table

From: Paul Rohr (paul@abisource.com)
Date: Sat May 18 2002 - 13:40:39 EDT

  • Next message: Paul Rohr: "design q -- are revisions linear or parallel?"

    While I haven't had time to carefully read and think through all the design
    work going on in the "revision marks" threads, I'm happy to see that people
    are seriously engaging with the complexity of the problem.

    The following conceptual overview of how the piece table works might be
    helpful for folks who aren't familiar with it.

    piece table concepts
    --------------------
    If you dig into the implementation of the piece table, you'll note that it
    includes:

      I. the entire document as originally loaded,
      II. all new content & formatting added during the edit session,
      III. an audit history of all the changes, in order, and
      IV. bookkeeping information to construct a "current" view of the document.

    Of these, the rest of AbiWord spends most of its time dealing with IV. III
    is created in response to specific edit methods, and is also somewhat
    visible via the undo/redo mechanism that allows you to replay them. By
    comparison, the first two are pretty low-level and the rest of the AbiWord
    never sees them.

    For example, the drawing code just references IV, as do the exporters.

    revision marks
    --------------
    AFAICT, the most useful thing about revision marks is the ability to tell
    the difference between two states of a document:

      - what it was like when someone loaded it, and
      - what it was like when they saved it.

    In short, while you do want to know overall what the diff between those two
    states was -- content that got added, deleted, moved, or reformatted by a
    specific contributor -- you *don't* need to know what all of the specific
    typing glitches in between were. For example, if an editor retyped a
    sentence fifteen different ways before saving it, that's just not useful
    information. [Plus which, it's a pain in the ass to keep track of it.]

    the insight
    -----------
    If you look at how the piece table is constructed in memory, there are
    basically five chunks of memory.

      - a contiguous buffer of the original document's text (read-only)
      - a contiguous buffer of all new text typed (append-only)
      - the original document's properties (again, read-only)
      - new properties added (again, append-only)
      - a heck of a lot of bookkeeping to tell which parts of each are being
        used in the current version.

    In short, I believe that this is enough information to detect all of the
    relevant changes at export time.

    caveat
    ------
    The following example just shows how to detect text-level changes. There's
    definitely an additional level of complexity needed to capture formatting
    changes, too.

    (Indeed, FWIW, I sometimes think that we should take advantage of the fact
    that our file format is text-based, and use a tailored XML-aware version of
    diff to handle all of the low-level work here.)

    BTW, *do* people use revision marks to keep track of formatting changes too?
     What does the UI look like?

    for example
    -----------
    Say that the original document was:

      The quick brown fox jumped over the lazy dog.

    After a few cuts and pastes, plus a little typing, the new document reads:

      The lazy green fox slept next to the quick dog.

    In short, the text stored in the piece table can be divided into the
    following five equivalence classes. The first three subdivide the content
    from the original document as follows:

    1. text that didn't change

      The quick brown fox jumped over the lazy dog.
      ^^^^ ^^^^ ^^^^ ^^^^

    2. text that's no longer used

      The quick brown fox jumped over the lazy dog.
                ^^^^^^ ^^^^^^^^^^^^

    3. text that got moved:

      The quick brown fox jumped over the lazy dog.
          ^^^^^ ^^^^

    The other two equivalence classes come from the "other half" of the piece
    table, namely:

    4. new text that got used

      sloppieysleppt on top of ofver greenmowvemauve next to
              ^^^^ ^ ^^^^^ ^^^^^^^

    5. new text that got abandoned

      sloppieysleppt on top of ofver greenmowvemauve next to
      ^^^^^^^^ ^ ^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^

    Yes, that's actually what the piece table looks like at the lowest levels.
    [Indeed, it's usually even messier.] There are two big streams of text:

      - one as imported
      - one of all the characters typed in this edit session

    All of the rest of the PT machinery just assembles together the relevant
    pieces of the text as needed. It's a really, really cool data structure.

    so what?
    --------
    Take a look at the five equivalence classes in the above example. When we
    currently persist the file format, the piece table extracts an ordered
    sequence of fragments of text that corresponds to the following equivalence
    classes:

      1. unchanged text
      3. moved text
      4. new text

    Even if we're supporting revision marks, there's still no reason to export
    the following equivalence class:

      5. new-and-abandoned

    However, we *will* want to export the other four, and differentiate them
    accordingly:

      1. unchanged text
      2. deleted text
      3. moved text
      4. new text

    My hope is that when revision tracking is turned on, we can algorithmically
    identify which equivalence class a given frag is in, and then emit and flag
    them, in order, at export time. For example, the following are easy:

      #4 is used, and comes from the append buffer
      #1 is used, and comes from the read-only buffer
      #2 is *not* used, and comes from the read-only buffer

    The first two (#4 and #1) are easy to detect and sequence properly. #2 can
    be inferred (by a brute force scan, if needed), although getting them
    ordered properly will be more work.

    Thus, the only screw case will be #3. As with #1 and #4, sequencing them is
    easy -- that's what the PT is already doing for us -- but it's not
    immediately obvious what the best way is to differentiate #3 from #1.
    Export-time alternatives might include:

      - scanning change records,
      - maintaining a stack to detect reordered frags, or
      - something more clever.

    To be honest, I forget how Jeff and I actually coded the copy/paste logic.
    It very well may be that equivalence class #3 gets decomposed into a pair of
    add/delete operations. If so, then this isn't even an issue.

    next steps
    ----------
    The above explanation just addresses whether and how to detect changes
    between the version loaded and the version about to be saved. It does NOT
    address:

      - how that should be reflected in the file format, NOR
      - how small a subset of CVS's branching & merging features to reinvent.

    As we've seen, the latter can quickly get way too complex. Still, I hope
    that people find this a useful starting point.

    Enjoy!

    Paul



    This archive was generated by hypermail 2.1.4 : Sat May 18 2002 - 13:43:12 EDT