Re: A slight regression in 2.9.1 text:p loading...

From: Ben Martin <monkeyiq_at_users.sourceforge.net>
Date: Wed Jul 20 2011 - 07:53:32 CEST

On Tue, 2011-07-19 at 10:15 +1000, Martin Sevior wrote:
> Hi Ben,
>
> Thanks very much for this clear investigation of a very convoluted spec!

I'm now clarifying on the OASIS list. Hopefully I'll have full clarity
soon :)

FWIW the following fragment when loaded by OpenOffice 3.3.0 only shows
one space between "foo" and "bar". Logically, when saved this one space
that is retained is the one in the text:span and it seems the second
space just before the "bar" is the one that gets the axe.

<text:p text:style-name="Standard">Hi there <text:span
text:style-name="T1">foo </text:span> bar</text:p>

On the other hand, abiword trunk (and 2.9.1) retains both spaces giving
"Hi there foo bar"

>
> Just one point about your commit here:
>
> + UT_UCS4String s = ODi_textp_fold_whitespace( pBuffer, length );
> + s = ODi_textp_compact_two_or_more_spaces( s );
> + if(!m_bContentWritten)
> {
> - printf(">>> content written!\n");
> - m_charData += UT_UCS4String (pBuffer, length, true);
> + s = ODi_textp_trim_whitespace_leading( s );
> }
> + m_charData += s;
> +
> + printf(">>> content written!\n");
> + UT_DEBUGMSG(("charData() s:%s\n", s.utf8_str() ));
> }
>
> lets, turn that printf(">>> content written \n");

oh, I committed this tiny udpate too btw.

>
> into a UT_DEBUGMSG so we don't put debug info into a users terminal
> while we're loading an odt doc!
>
> I should never have left that code there :-(
>
> Cheers
>
> Martin
>
> On Tue, Jul 19, 2011 at 9:58 AM, Ben Martin
> <monkeyiq@users.sourceforge.net> wrote:
> > Hmm, one flaw to my flush() plan is that flush() is called to
> > appendSpan() during a load. In reality it might not matter so much
> > because the pathological case that I see causing problems might be so
> > rare.
> >
> > The problem case would be with
> > <text:p>foo <text:span>bar </text:span> bzzzzt</text:p>
> >
> > where the spaces after bar and before bzzzzt seem to require being
> > coalesced to a single space. Point 2 of the ODF spec appears to
> > concatenate the text of all text:p, Point 3 trims leading/trailing
> > whitespace, Point 4 then coalesces 2+ spaces to 1 in the content.
> >
> > If one coalesces the intermediate spaces in the above, it does raise the
> > question of where the remaining space should be placed (inside or
> > outside the text:span close). So maybe I'm reading point 2 slightly
> > wrong.
> >
> > My current fix which I think I'll commit shortly does much the same as
> > Martin's code but also performs newline and space coalescing for the
> > initial call to charData(). The doc from 11847 seems to render the same
> > for my code and 2.9.1.
> >
> >
> >
> > On Mon, 2011-07-18 at 18:31 +1000, Ben Martin wrote:
> >> Hi Martin,
> >>
> >> Unsurprisingly, this is actually a bit more complex than I originally
> >> thought :) I was reading over the part of the ODF spec I cited, and
> >> traces of charData() calls, and the bug that ran text together when
> >> whitespace is eaten 12813.
> >>
> >> A fairly simple fragment created with OO is the following, which should
> >> all be on one line.
> >>
> >> <text:p text:style-name="Standard">
> >> Hello <text:span text:style-name="T1">there</text:span> you
> >> </text:p>
> >>
> >> In this case one believes that the spaces are significant. Abiword's
> >> charData might be fed these during a load:
> >> "Hello "
> >> "there"
> >> " you"
> >>
> >> The spec goes:
> >> 2) The character data of the paragraph element and of all descendant
> >> elements for which the OpenDocument schema permits the inclusion of
> >> character data for the element itself and all its ancestor elements up
> >> to the paragraph element, is concatenated in document order.
> >>
> >> 3) Leading " " (U+0020, SPACE) characters at the start of the resulting
> >> text and trailing SPACE characters at the end of the resulting text are
> >> removed.
> >>
> >> So it would seem that accruing m_charData in the charData() method and
> >> treating the removal of leading and trailing whitespace in flush() might
> >> be more appropriate. The charData() method should probably fold anything
> >> in
> >> Z = { U+0009, U+000D, U+000A )
> >> to the SPACE = U+0020 char as per the spec, and maybe collapse
> >> contiguous sequences of 2+ spaces to 1 space too.
> >>
> >> The folding of "other" whitespace into spaces and the collapsing must be
> >> done in charData() needed so that "text:line-break" and "text:s" code
> >> can still simply work on m_charData. If folding/collapsing was delayed
> >> to flush() then those particular cases would have their effect ignored.
> >>
> >> This does leave the sticky case of a text:s at the start of a text:p.
> >> The spec makes it hard to tell if one or more text:s at the beginning of
> >> a text:p should actually stick around or not. It seems not, looking at
> >> the following quote from the spec 6.1.3:
> >> The <text:s> element is used to represent
> >> the [UNICODE] character " " (U+0020, SPACE).
> >>
> >> If this was a special case for leading whitepsace in a text:p then I
> >> would imagine that 6.1.2 would explicitly mention it.
> >>
> >> Of course, once I'm somewhat happy I'll check 11847 against 2.9.1 and my
> >> modified trunk.
> >>
> >> On Mon, 2011-07-18 at 11:04 +1000, Martin Sevior wrote:
> >> > Hi Ben,
> >> >
> >> > Yes, I did write that code! it was revision 29316 with the commit comment:
> >> >
> >> > "Properly fix bug 11847 Spaces inserted before paragraphs."
> >> >
> >> > So when you do your fixes, check that bug 11847 remains fixed :-)
> >> >
> >> > Cheers
> >> >
> >> > Martin
> >> >
> >> > On Mon, Jul 18, 2011 at 10:50 AM, Martin Sevior <msevior@gmail.com> wrote:
> >> > > Hi Ben,
> >> > >
> >> > > This seems vaguely familiar to me but revision 29384 appears t be a
> >> > > commit by uwog that cleans up a bit of code. As far as I can tell
> >> > > if(!m_bContentWritten) {...} was present before 29384.
> >> > >
> >> > > I think I might have put that code in place but I can't recall the
> >> > > reason. I'll do a bit of digging with viewVC
> >> > >
> >> > > Cheers
> >> > >
> >> > > Martin
> >> > >
> >> > >
> >> > > On Sun, Jul 17, 2011 at 4:58 PM, Ben Martin
> >> > > <monkeyiq@users.sourceforge.net> wrote:
> >> > >> Hi,
> >> > >> I dug into the loading of hand crafted ODT files in 2.9.1 and have
> >> > >> filed a bug 13112 [1]. I've been looking into fixing the bug, but also
> >> > >> don't want to step on anyone's toes in doing so.
> >> > >>
> >> > >> The gist of what I see being the problem is a special case in void
> >> > >> ODi_TextContent_ListenerState::charData() where
> >> > >> if(!m_bContentWritten) {...} is called. I'm not sure what use case
> >> > >> prompted this special case to be added, so I don't want to blindly
> >> > >> change it without discussion first.
> >> > >>
> >> > >> It seems from "6.1.2 White Space Characters" of the spec [2], in
> >> > >> particular page 120 of [3], that leading and trailing whitespace is to
> >> > >> be removed, and internal whitespace is to be collapsed. ie, a stream of
> >> > >> 2+ SPACE to be replaced with a single SPACE.
> >> > >>
> >> > >> Prior to collapsing anything in
> >> > >> Z = { U+0009, U+000D, U+000A ) is first replaced with
> >> > >> SPACE = U+0020.
> >> > >>
> >> > >> The current code, when given something like the following will preserve
> >> > >> the newline after magic;
> >> > >>
> >> > >> <text:p>
> >> > >> I am a magic
> >> > >> and special wizard<text:span>....
> >> > >> </text:p>
> >> > >>
> >> > >> This is because the entire string "\n I am a magic\n and special
> >> > >> wizard" will be passed to charData() on the first call when
> >> > >> m_bContentWritten == false;
> >> > >>
> >> > >> FWIW the special case was brought in at revision 29384.
> >> > >>
> >> > >>
> >> > >> [1] http://bugzilla.abisource.com/show_bug.cgi?id=13112
> >> > >> [2] http://docs.oasis-open.org/office/v1.2/cos01/
> >> > >> [3]
> >> > >> http://docs.oasis-open.org/office/v1.2/cos01/OpenDocument-v1.2-cos01-part1.pdf
> >> > >>
> >> > >>
> >> > >
> >> >
> >>
> >
> >
>

Received on Wed Jul 20 07:53:48 2011

This archive was generated by hypermail 2.1.8 : Wed Jul 20 2011 - 07:53:48 CEST