ADV: XML to CSV Converter

Thu Jan 3 13:31:09 PST 2008

On Thu, Jan 03, 2008 at 02:34:09PM -0500, after drawing runes in goat's blood,
Fairlight cast forth these immortal, mystical words:
> In the relative spacial/temporal region of
> Thu, Jan 03, 2008 at 11:55:58AM -0700, fp at casabellagallery.com achieved the spontaneous
> generation of the following:
> > Aren't these things freely available?  A quick google search
> > turned out dozens of options.  A7Soft xml2csv and csv2xml appear
> > to be very popular.  I even came across some that run on *NIX
> 
> Show me where it says A7Soft xml2csv auto-decodes Base64 embedded
> files.  Or changes the quote character (the screenshot at
> http://www.a7soft.com/xml2csv.html shows only an option to -include-
> quotes, but nothing to define what character is used), or change the field
> separator (list research shows that "~" instead of "," is a common fix to
> many problems, for instance.

Actually, I just downloaded A7Soft's xml2csv.  I misspoke about its
inability to change comma, but quotes are all-or-nothing.

But it's worse than that.  A lot worse.  It doesn't actually let you just
get the whole file translated (nevermind embedding docs).  You have to
request specific fields.  I requested _CountyName from a PRIA document I've
been using as a test document.  You know what it gave me?

_CountyName
King
King

That was literally the output.  Now, it said "King" twice because there are
-two- instances of _CountyName.  They're attributes to two different
elements: _PROPERTY_ADDRESS and _RETURN_TO_PARTY.

What in their output shows me that?  Nothing.  You can apparently assign
aliases to get the same name element or attribute from different parents,
but this is less than ideal.

Looking further at the examples in their xml2csv.txt (that came with it),
it's quite apparent this was meant to pull a few fields from pretty
simplistic XML.  Visit the PRIA XML some places are using and it becomes an
-entirely- different ballgame.

Okay, I just tried a PRIA document with multiple instances of _GRANTOR and
_GRANTEE.  Both have many of the same attributes.  I asked for:

_FirstName,_LastName,_UnparsedName

You know what I got?  I'll put it between stars:

*****
D:\>type tsi.csv
_FirstName,_LastName,_UnparsedName
CHRISTOPHER,THOMAS,
THOMAS,ROBERTSON,
STATE,OF WASHINGTON (DEPT OF),STATE OF WASHINGTON (DEPT OF)
,,
,,
,,
,,
,,
CHRISTOPHER,THOMAS,
STATE,OF WASHINGTON (DEPT OF),STATE OF WASHINGTON (DEPT OF)
,,
,,
,,
,,
,,
*****

That's because there are grantors, but not grantees, and there are two
submissions including full sets of each within the same document (this is a
batch XML file I'm working off of).  Some of the fields requested also fall
into other elements, which is why there are even more lines than there
should be.

You tell me how -that- is useful in any way, shape, or form.  Please do.
It maintains hardly any sane relationships between data.  There appears to
be no way to correlate asking for another field from another element in a
different part of the heirarchy at the same time:  e.g.:

<root>
     <nest_1>
     </nest_1 attr_1="something">
     <child_1 attr_325="else">
</root>

You can't seem to correlate attribute 1 and attribute 325 with the syntax
they give you and the output that's generated.  Not easily, anyway.  You'd
have to specify aliases for each field, giving the full path to each field
you want.  Why, again, should I have to alias my fields when they already
have perfectly good names?

Now, consider that the PRIA spec I tested against (amongst others) is
layered 6+ elements deep and has a -boatload- of attributes for most
elements--many of which share common names like _FirstName.  Oh, and I was
told they could send up to 160 batch entries in one batch, just by and by.

Okay, big difference:  Mine lets you do something sane and -entirely- track
your way through the data relationships with 100% certainty.  Theirs, from
what testing I've done, does not.

I looked at the "request certain fields" approach at the start of design,
and I entirely abandoned that line of thought when I thought about multiple
elements in multiple heirarchies, iterated a dynamic range of times.  It's
just not suited to the task.  If I wanted to get the entire document
translated, with their solution I'd have to create what I'm guessing are
several hundred lines worth of configuration aliases for -one- file to be
parsed--and I'm still not sure it would give me the results I'd expect,
even if I thought having one huge line was a good idea.  Actually, from the
looks of it, it just gave me results I -didn't- expect, nor could anyone
use.

That was originally my plan as well.  One line per "record".  That falls
by the wayside very quickly when you look at real-life examples.  It
works -great- against very, very simple XML.  Get into the complexity
of the real-life stuff that people have to deal with and it falls apart
faster than a paper airplane in white water rapids.  CSV is -not-
suited to one-to-many relationships in heirarchical order, using a
one-line-per-record model that encompasses the whole record--what you'd
think of as one record.  Thought about that for 4 days, and it's just not
tennable.  It's precisely because you can have sub-records or super-records
that this model falls apart under any sort of stress.

Go ahead and try to use the A7Soft against something like a PRIA submission
and get anything useful/sensible out of it.  This is apparently a standard
with government agencies, btw.  But yeah, great free parser they have
there.  For enthusiasts doing very simple XML, in my opinion.  It's nothing
I would recommend a client put into production.

Incidentally, your original comment was pretty absurd.  What if Chevy,
Oldsmobile, Chrysler, Toyota--and more recently Saturn and Kia--had said,
"Well isn't Ford already making cars?"

mark->