Thursday, March 29, 2012
Filter for WordML vs Filter for XML vs Filter for Text
format) docs into a column of our SQL Server 2005 database and full text
indexing on it. Works okay, but not great. For example, all the XML tags
are indexed... so you find many words that are not in the document (from the
user persepctive).
The fix for that is clear... use an XML filter. (Though how to do that is
not so clear... we tried moving the WordML from a Text column to an XML
column, but it gives errors on illegal characters in the Word doc.)
However, it would seem that even an XML filter will not do nearly as well as
a filter designed for WordML.
Can anyone point me to a WordML full-text-search filter for SQL Server
2005?
Thanks,
Brian
Have you tried storing them as doc's in varbinary (max) or image columns and
indexing them with the Word iFilter.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Brian" <TargetedConvergence@.newsgroup.nospam> wrote in message
news:%23bHGZ5idHHA.4616@.TK2MSFTNGP03.phx.gbl...
> We currently are saving WordML (Word 2003's XML format rather than binary
> format) docs into a column of our SQL Server 2005 database and full text
> indexing on it. Works okay, but not great. For example, all the XML tags
> are indexed... so you find many words that are not in the document (from
> the user persepctive).
> The fix for that is clear... use an XML filter. (Though how to do that is
> not so clear... we tried moving the WordML from a Text column to an XML
> column, but it gives errors on illegal characters in the Word doc.)
> However, it would seem that even an XML filter will not do nearly as well
> as a filter designed for WordML.
> Can anyone point me to a WordML full-text-search filter for SQL Server
> 2005?
> Thanks,
> Brian
>
|||"Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
news:u2u%23lFjdHHA.4656@.TK2MSFTNGP06.phx.gbl...
> Have you tried storing them as doc's in varbinary (max) or image columns
> and indexing them with the Word iFilter.
No, for our app we need to keep them as .xml files, not .doc files.
But we could put the Word .xml file in a varbinary column if there is a
WordML iFilter out there that would act on the .xml file.
Thanks,
Brian
|||Hi Brian,
have you tried storying your WordML files in an a column of type XML? That
should invoke the XML filter.
Best regards,
-Denis.
"Brian" wrote:
> "Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
> news:u2u%23lFjdHHA.4656@.TK2MSFTNGP06.phx.gbl...
>
> No, for our app we need to keep them as .xml files, not .doc files.
> But we could put the Word .xml file in a varbinary column if there is a
> WordML iFilter out there that would act on the .xml file.
> Thanks,
> Brian
>
>
|||"denistc" <denistc@.discussions.microsoft.com> wrote in message
news:64C9346C-881C-4FC5-928C-1C7751ACECA6@.microsoft.com...
> have you tried storying your WordML files in an a column of type XML? That
> should invoke the XML filter.
Yes, though we've had trouble with some WordML files being rejected.
We've narrowed the issue on that front... any WordML file that has "UTF-8"
in the docheader gets rejected... not sure why though. Also not sure why
Word is creating some files with that encoding. Simply removing the tag
from the docheader (without any other re-encoding) seems to work fine, oddly
enough.
We're a bit nervous to move to the XML column until we understand what
conditions might cause XML to reject a WordML doc coming out of Word...
because without that understanding, we can't be sure it won't happen to our
customers.
Thanks for the suggestion,
Brian
|||"Brian" <TargetedConvergence@.newsgroup.nospam> wrote in message
news:%23OqdVzFgHHA.2396@.TK2MSFTNGP04.phx.gbl...
> Yes, though we've had trouble with some WordML files being rejected.
> We've narrowed the issue on that front... any WordML file that has "UTF-8"
> in the docheader gets rejected... not sure why though. Also not sure why
> Word is creating some files with that encoding. Simply removing the tag
> from the docheader (without any other re-encoding) seems to work fine,
> oddly
> enough.
Haven't done much with WordML myself, but assuming the WordML file has
"UTF-8" in the docheader, I'd be interested to know if it actually is a
UTF-8 file? Or is it possible it is being saved with an incorrect BOM or
invalid (non-UTF-8-encoded) characters in the doc? One of those two would
be my first guess. If so, that might be a bug that needs to be reported to
MS. You might try opening that WordML file in a hex editor to verify what
is actually being stored.
Friday, February 24, 2012
File name with sp_xml_preparedocument - Not working ?
Is it legal to give filename to sp_xml_preparedocument
Can i do
DECLARE @.hdoc int DECLARE @.doc varchar(1000) SET @.doc ='C:/XML/myXmlFile.xml
' --Create an internal representation of the XML document. EXEC sp_xml_preparedocument @.hdoc OUTPUT, @.doc -- Remove the internal representation. exec sp_xml_removedocument @.hdoc
I know the XML is well formed because when the paste the file in @.doc above it works
I tried doing it and i get the following error:
The XML parse error 0xc00ce556 occurred on line number 1, near the XML text "C:\XML_Processing\Test.xml".Msg 6602, Level 16, State 2, Procedure sp_xml_preparedocument, Line 1The error description is 'Invalid at the top level of the document.'.Msg 6607, Level 16, State 3, Procedure sp_xml_removedocument, Line 1sp_xml_removedocument: The value supplied for parameter number 1 is invalid. The XML is :<ROOT>
<Customer CustomerID="VINET" ContactName="Paul Henriot">
<Order CustomerID="VINET" EmployeeID="5" OrderDate="1996-07-04T00:00:00
<OrderDetail OrderID="10248" ProductID="11" Quantity="12"/>
<OrderDetail OrderID="10248" ProductID="42" Quantity="10"/>
</Order>
</Customer>
<Customer CustomerID="LILAS" ContactName="Carlos Gonzlez">
<Order CustomerID="LILAS" EmployeeID="3" OrderDate="1996-08-16T00:00:00
<OrderDetail OrderID="10283" ProductID="72" Quantity="3"/>
</Order>
</Customer>
</ROOT>
Hemanshu:
I think the argument to the sp_xml_preparedocument stored procedure is that actual XML text and NOT a file reference. Please give a look to the SP_XML_PREPAREDOCUMENT stored procedure in books online.
You might be able to do something like:
Code Snippet
select * from openrowset( BULK 'C:/XML/myXmlFile.xml' SINGLE_BLOB)
to fetch the data. Give a look to OPENROWSET in books online.