Thursday, March 29, 2012

Filter for WordML vs Filter for XML vs Filter for Text

We currently are saving WordML (Word 2003's XML format rather than binary
format) docs into a column of our SQL Server 2005 database and full text
indexing on it. Works okay, but not great. For example, all the XML tags
are indexed... so you find many words that are not in the document (from the
user persepctive).
The fix for that is clear... use an XML filter. (Though how to do that is
not so clear... we tried moving the WordML from a Text column to an XML
column, but it gives errors on illegal characters in the Word doc.)
However, it would seem that even an XML filter will not do nearly as well as
a filter designed for WordML.
Can anyone point me to a WordML full-text-search filter for SQL Server
2005?
Thanks,
Brian
Have you tried storing them as doc's in varbinary (max) or image columns and
indexing them with the Word iFilter.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Brian" <TargetedConvergence@.newsgroup.nospam> wrote in message
news:%23bHGZ5idHHA.4616@.TK2MSFTNGP03.phx.gbl...
> We currently are saving WordML (Word 2003's XML format rather than binary
> format) docs into a column of our SQL Server 2005 database and full text
> indexing on it. Works okay, but not great. For example, all the XML tags
> are indexed... so you find many words that are not in the document (from
> the user persepctive).
> The fix for that is clear... use an XML filter. (Though how to do that is
> not so clear... we tried moving the WordML from a Text column to an XML
> column, but it gives errors on illegal characters in the Word doc.)
> However, it would seem that even an XML filter will not do nearly as well
> as a filter designed for WordML.
> Can anyone point me to a WordML full-text-search filter for SQL Server
> 2005?
> Thanks,
> Brian
>
|||"Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
news:u2u%23lFjdHHA.4656@.TK2MSFTNGP06.phx.gbl...
> Have you tried storing them as doc's in varbinary (max) or image columns
> and indexing them with the Word iFilter.
No, for our app we need to keep them as .xml files, not .doc files.
But we could put the Word .xml file in a varbinary column if there is a
WordML iFilter out there that would act on the .xml file.
Thanks,
Brian
|||Hi Brian,
have you tried storying your WordML files in an a column of type XML? That
should invoke the XML filter.
Best regards,
-Denis.
"Brian" wrote:

> "Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
> news:u2u%23lFjdHHA.4656@.TK2MSFTNGP06.phx.gbl...
>
> No, for our app we need to keep them as .xml files, not .doc files.
> But we could put the Word .xml file in a varbinary column if there is a
> WordML iFilter out there that would act on the .xml file.
> Thanks,
> Brian
>
>
|||"denistc" <denistc@.discussions.microsoft.com> wrote in message
news:64C9346C-881C-4FC5-928C-1C7751ACECA6@.microsoft.com...
> have you tried storying your WordML files in an a column of type XML? That
> should invoke the XML filter.
Yes, though we've had trouble with some WordML files being rejected.
We've narrowed the issue on that front... any WordML file that has "UTF-8"
in the docheader gets rejected... not sure why though. Also not sure why
Word is creating some files with that encoding. Simply removing the tag
from the docheader (without any other re-encoding) seems to work fine, oddly
enough.
We're a bit nervous to move to the XML column until we understand what
conditions might cause XML to reject a WordML doc coming out of Word...
because without that understanding, we can't be sure it won't happen to our
customers.
Thanks for the suggestion,
Brian
|||"Brian" <TargetedConvergence@.newsgroup.nospam> wrote in message
news:%23OqdVzFgHHA.2396@.TK2MSFTNGP04.phx.gbl...
> Yes, though we've had trouble with some WordML files being rejected.
> We've narrowed the issue on that front... any WordML file that has "UTF-8"
> in the docheader gets rejected... not sure why though. Also not sure why
> Word is creating some files with that encoding. Simply removing the tag
> from the docheader (without any other re-encoding) seems to work fine,
> oddly
> enough.
Haven't done much with WordML myself, but assuming the WordML file has
"UTF-8" in the docheader, I'd be interested to know if it actually is a
UTF-8 file? Or is it possible it is being saved with an incorrect BOM or
invalid (non-UTF-8-encoded) characters in the doc? One of those two would
be my first guess. If so, that might be a bug that needs to be reported to
MS. You might try opening that WordML file in a hex editor to verify what
is actually being stored.

No comments:

Post a Comment