XML parser timeline and complexity
Fairlight
fairlite at fairlite.com
Sat Feb 11 04:34:03 PST 2006
Boy. When I looked at doing a generic XML parser, I looked at a few ways
of doing it. Mostly, since I would be doing it from processing, I figured
on "reproducing" the structure of a DTD, for instance, in a lookup table's
records, and specifying a relation between tags (ie., parent names would be
included, and there would be an attribute field YESNO, etc.).
I've always used DTD and have resisted delving into Schema. I'm doing a
good bit of reading on it right now, and boy, did they make things far more
complex than necessary. The W3C may have ratified the standard, but only
M$ could come up with something -this- convoluted. True, it has more
power than DTD. But at what -cost-? My God.
I feel safe in saying that my 180 hour estimate for doing a whole generic
import parser was WAYYYYY optimistic. I can also easily see, given the
permutations involved to TRULY do a generic parser that will actually
validate structure against both DTD and Schema, much less import anything,
it could take a year or more.
Parsing syntactically is the -easy- part. The structural relation checking
is going to be a royal PITA.
Sadly, I may end up doing this anyway, as someone wants basically OneGate
but specifically for SOAP. And they kinda want the generic XML parser as
well. But just looking through not even the complete spec but a fair
amount of examples at a few sites, it's evident that doing structure
and data validation is going to be a complete PITA. No doubt in my mind.
And -then- you get people that only give you XML and don't bother including
a Schema -OR- a DTD (or even a reference to one or the other) in their
XML...they just "assume" that both/all parties have mutually agreed on
something and entirely leave the definition out of the document. I don't
think that should actually be accomodated. Include the spec, or no go.
It's the only sane way to fly.
I can fully believe a year long dev curve, especially if it's one person
coding it, maybe two, and especially if it's not the only thing on their
plate--as it obviously won't be.
Sure, it -seems- simple. And the stuff is incredibly easy to author on the
creation side. Parsing it -structurally- in addition to syntactically is
the real and utter timesink. I mean, if you know the syntax, writing the
actual data output in the proper structure is trivial. But parsing
it...that's kind of like teaching the computer how to think, to a degree.
It's not AI, but it can be a set of pretty complex relationships to "teach"
the computer how to reconfigure itself to honour.
I don't envy fP-Tech this one...assuming fPXML actually includes an import
parser and isn't just an export expeditor. Still no response from Bud,
publicly or privately, despite having CC'd him the post. I'm shocked! :)
But I'll need to do one way before they have one out the door, and damned
if I could find it in me to envy myself the task. Oy. I wish I could say,
"DTD only!!!" That would make things much simpler. Unfortunately, most
have fallen for Schema. :-/ Bleh.
I'm kind of thinking that John's comment about a one year out on this isn't
actually looking at the complexity from the -parsing- side. It's basically
like rewriting an interpreter, rewriting a new edit system if you're going
to do facet validation, a huge amount of relational structure work...it's
just nasty. I think I would actually be amazed if either party gets one
out within -only- one year, fully functional.
No offense to John. Just...looking at Schema alone from the -opposite-
side of the equation, I think it would be a minor miracle to pull one
together that does -everything- properly inside that time frame, especially
if they're not focusing soley on that. Maybe a year is overstating it,
but I -know- my 180 hour estimate is way, WAY off...like not even in the
ballpark.
There are a lot of ways to shortcut the process or to mandate definitios
put in fP like I originally thought to do. That would be a huge timesaver.
But to do it -properly-? And -completely-? You basically have to teach
the system how to assemble such relationships on the fly from the data
itself--and then make it honour them. It's more or less writing a large
part of an interpreter.
That year's looking more and more respectable a guess the more I look at
the actual details.
Okay, time for some Tylenol. :) My brain hurts.
mark->
--
Fairlight-> ||| "Loneliness is a power that we | Fairlight Consulting
__/\__ ||| possess to give and take away |
<__<>__> ||| forever." --Anderson/Yes | http://www.fairlite.com
\/ ||| | info at fairlite.com
More information about the Filepro-list
mailing list