Importing XML Generically: A Proposed Project

Tue Dec 18 11:45:28 PST 2007

On Tue, Dec 18, 2007 at 10:29:29AM -0800, Bill Campbell, the prominent pundit,
witicized:
> I'm hardly expert at ways of processing XML, mostly doing things per case
> where I have to.

More or less the same, here. 

> The fundamental problem is that XML is essentially a hierarchical, object
> oriented structure while FilePro is similar to a relational database (that
> should get some flames going :-).  One may well have to import multiple
> tables from a single XML file.

Well, if you think about the heirarchical nature of XML, it's basically
relational as well.  It's both at once, really.

Given the ability to write multiple .conf files for the same XML, one could
actually derive both main-record and header-detail records from different
runs to create CSV files from the same exact XML file.  I'm not seeing too
much of a problem there.  The only real issue is in the limitations of a
CSV format--namely that unless you're separating values within
doublequotes, there's no real way to accomodate "unlimited" numbers of
relations in a line.  Unless you mandate in your XSD a specific number of
elements (<item_1> through <item_5> and that's -ALL- you get, and that's
also required, even if you have to do <item_5/> if nothing is present for
contents), well the whole concept of one-to-many gets irritating very
quickly.

The CSV solution admittedly isn't ideal.  Ideally, fP would be able
to parse XML internally and just plain import it.  I'm aware of the
theoretical fPXML vaporware (it's just as vaporware as mine...neither is
released, we both have code written).  I've also been told as recently
as today that estimates were six months to a year out for their release.
Historically, a 6-12 month release guesstimate really translates to 12-36
months.  If I kept going at the rate I was this morning, and barring
unforseen issues (like every model I have in mind being untennable, and
having to rethink the whole model) and any significant interruptions, I
could be done with what I propose in 2-4 ***weeks***, and that's not solid
coding time, that's a timeframe in which I could develop, test, debug, and
document.  But like I said, I want to have a decent assurance the ROI will
be sufficient enough to justify it.  What I've done so far can be used
elsewhere when I need it.  To go much farther, I'm going to need incentive.

Sure, an internal parser will have advantages--assuming it ever
materialises. fP-API was semi-officially announced something like three
years ago--it's still not out.  From what I've read, it's apparently now
fP-Mobile unless I misread something, and I'm not even trying to track the
release date on it.  I've given up on even what few release time-frames
they feel "confident" enough to give out.  5.6 was released over a year
after I was told by someone at the company that release was imminent, 'nuff
said.

Point is...people "get" IMPORT.  Give them a CSV, they can import their
data.  If they get multiple values in a field, they can splice out the
individuals and populate detail records (so multiple conversions aren't
necessary, they're just another way to skin the cat).  But parsing the XML
internally within processing is just a complete bitch, and I've done it
myself so I can say that with some relative authority.  Basically, what
I propose would act as a really strong stopgap measure until fPXML is
actually available, if ever.  It would remain a solution as long as one
wanted, of course, if it met someone's needs.

> I generally dig into XML files by writing a python module with a class
> structure that models the XML.  Each class then has a method to generate
> CSV/Tab delimited output as necessary.

Historically, I wrote perl XML::Parser handlers that had -very specific-
field tracking code in them.  I'd rewrite a new set of handlers for each
project, replacing field-specific logic.

I'm writing the current handlers within a different architecture so that
they handle things based on global state information about where one is
within the heirarchy, and plan on populating "record" information (ie.,
each line of data), then kicking it out when the "record separator" element
is closed.  The biggest two issues facing me are basically finding the most
efficient method of internal storage as I go, and combining the "store as
you go" methodology with the "I want these fields in this particular order"
methodology, while maintaining good performance.  I've given it a fair deal
of thought recently and think I have several workable solutions to both.

> I suspect that XSLT might be useful in breik and XML file into these
> multiple imports.

UGH.  The last thing I want to do is mess with XSLT.  I just last night
had someone inquire about any potential new versions of sablotron, as that
isn't working right (something about sorting).  I don't even want to touch
it.  I think it has more going against it for a proposition like this than
for it.  Interesting idea, I just disagree.  I may be thinking about it
differently than you at a fundamental level.  I see it as injecting another
techology into the stream rather than stripping as many layers away as
possible.

mark->