XML Import

Mon Apr 13 12:29:14 PDT 2009

-----Original Message-----
From:
filepro-list-bounces+scottwalker=ramsystemscorp.com at lists.celestial.com
[mailto:filepro-list-bounces+scottwalker=ramsystemscorp.com at lists.celestial.
com] On Behalf Of Fairlight
Sent: Monday, April 13, 2009 1:57 PM
To: 'filePro Mailing List'
Subject: Re: XML Import

>From inside the gravity well of a singularity, Scott Walker shouted:
> Mark,
> 
> Thanks for your thoughts.  I looked at your xml2cvs and I can see how that
> would make thinks easier.  Like you said, that still leaves me with hand
> coding of the fp process to look at the cvs file and walk through the
> structure and figure out what to do with each line of data.  Big job for
one
> xml source.  Huge job for 20 different xml sources with different schema.
> Maybe not practical to do.

It's not -too- bad if the information you actually need falls in specific
areas and you can prune whole areas.  Actually, pruning is kind of a
convenience, since you can technically just walk the data tree with
entry/exit hints and ignore the rest.  But it puts less CPU cycles to the
job of walking the tree later in any event.

See after your notes for a practical, real-world point of interest that may
shoot this in the foot.

> I don't have a firm idea of how this could be done in an automated or
> semi-automated way.  I'm just starting to think about it a get a model in
my
> head.  
> 
> Something like this:
> 
> Note:  My fields may be a real field number or real field name or an array
> (since there may be multiple instances of a tag, like when we are handling
> the line items on an order
> 
> I would have a record like this for each XML source in my XML_Sources
file:
> 
> My Fields			XML Tag (for this source)		One
> or Many
> =========			=========================
> ===========
> Cust_Code			header.customer_id			one
> 
> Order_Date			header.order_dt				one
> 
> Part_Num[xx]		item.part_no				many
> 
> Desc[xx]			item.description
> many
> 
> Qty[xx]			item.quantity				many
> 
> 
> Then when this was run, I would end up with all the data from the xml
source
> in my fields/arrays.  My processing would then have to manage getting this
> data to my real (permenant) files.
> 
> So when I had to deal with the next XML source, in THEORY, perhaps, I
> hope/dream, that all I would have to do is create another record in my XML
> Sources basically mapping the schema for that source to the fields/arrays
in
> my processing.  At that point, there would be no additional coding
necessary
> to get the data to my permanent files.
> 
> Anyhow, this is currently totally half baked and is probably ignoring all
> the real world problems that will be encountered.  Just a rough idea for
> now.

Oh, that falls down in so many ways.  :(

Consider a very simple multi-record response.

<responses>
     <response>
          <order_id>9087251AB</order_id>
          <shipping>
               <first_name>Scott</first_name>
               <last_name>Nelson</last_name>
          </shipping>
          <billing>
               <first_name>Mark</first_name>
               <last_name>Luljak</last_name>
          </billing>
     </response>
     <response>
          <order_id>829025CA2</order_id>
          <shipping>
               <first_name>John</first_name>
               <last_name>Esak</last_name>
          </shipping>
          <billing>
               <first_name>Nancy</first_name>
               <last_name>Palmquist</last_name>
          </billing>
     </response>
</responses>

Okay.  Just that bit as a tiny real-world example of a multi-record
response set...take this little bit.

Assume you map first_name to a field, last_name to another field.
Suddenly, you have multiple last names per record.  Or do you?  What are
you defining as "a record"?  Each instance of Billing?  Each instance of
Shipping?  Each response?  A permutation of both, with header/detail,
requiring multi-pass recursion?

And you want to create records automatically, based on multiple formats,
where "record" could have vastly different meanings?

This is why xml2csv never formally (in a release sense--or even past
pre-pre-pre-pre-alpha proof-of-concept code) went the route of trying to
figure it out.  I started down that path, but it got so convoluted, it
was absolutely impossible to generically define "what is a record"?  The
data correlations change with fluidity to the point where "record" changes
drastically.  You may be looking at sub-records of sub-records of records
and calling those "records" in fP in a header-detail relationship.  Or
you may be looking at one single-response, where a whole response is a
record.  

In a -really- bad case, you may be looking at something like PRIA, a
standard where multiple whole data structures are repeated in different
parts of the document, complete with identical information, nested 3-6
levels deep, and you're only looking at getting one particular subset of
data out of it.  Or different sets of data as different types of "records".

The real problem becomes (and I ran into this in v00.00.01 of xml2csv when
it had a whole different focus similar to yours--before I realisd it was
just flat-out NOT POSSIBLE) what happens when you try to define "a record",
but it's really a subset of another kind of iterative record.  Now try
defining that abstractly so that it applies to multiple formats.  It wasn't
even possible to do reasonably for -one-.

Like I said, AI is basically needed for anything near the kind of black box
you're talking about.  I tried it with tools a lot more suited to the task
than fP has (libexpat, XML::Parser in perl), and it still wasn just not
possible in terms of conventional programming.  You've got randomly
permutated Schemas/DTDs, and you want it to intelligently define, "What is a
record?"  Just within the PRIA standard, ONE Schema, that wasn't even
technically viable, because the reality was that it was records of records
of records of records, maybe 5-6 deep, and depending on how you looked at
it, it could have been any of those permutations.  The real sticking point
is this:

1) If you define the record scope too narrowly, you miss correlation (and
indeed most likely data) outside of that record's scope.

2) If you define the record scope too widely, you miss the ability to
derive meaningful header/detail relationships and form sub-records of any
meaning without significant recursion.  That recursion is -not- something
you want to toss at filePro, or probably do at all.

This is why "walking the data tree" is really the only sane way to do it
reasonably.  And it's why you can only go so far with something like
xml2csv before you -must- roll up your sleeves and get your hands dirty
coding.

I spent months just looking at how to make it do what you want to do.  I
came to the conclusion it was simply not viable within any reasonable price
point you care to name.  AI people don't come cheap.

It's not like I haven't given the matter considerable thought.  It was
on/off for two years that I considered how to do it.  It was a couple
months of more hardcore thinking about it before starting to code and at
the start of coding.  If I was still taking that approach, I wouldn't be
done for another two decades, if ever.  It would cost me more to learn the
AI techniques necessary than I could ever recoup at any affordable price
point.  It would for most people, in all likelihood.

Just remember:  "What is a record?"  Hold up 3-5 different Schemas and try
to define it in some meaningful way that crosses data definition
boundaries.  Let me know if you succeed.  I never did.

mark->

Mark,

Yeah I know this is an ugly problem.  That's why I was asking if anyone had
solved it.  I did not want to reinvent this wheel if someone had already got
it knocked.  I am going to give this a lot of thought and explore a few
avenues before proceeding.  Thanks for your valued input.

Regards,

Scott
_______________________________________________
Filepro-list mailing list
Filepro-list at lists.celestial.com
http://mailman.celestial.com/mailman/listinfo/filepro-list