DotNet First Side: Querying SQL Server Using Full-Text Search

Querying SQL Server Using Full-Text Search

This section describes the different types of searches you can perform
using full-text search.
Querying varbinary(max) and xml Columns
Searching for Specific Word or Phrase (Simple Term)
Querying Multiple Columns
Performing Prefix Searches
Querying Linked Servers
Searching for the Inflectional Form of a Specific Word (Generation Term)
Integrating Full-Text Search and Transact-SQL Predicates
Searching for Words or Phrases Using Weighted Values (Weighted Term)
Comparing Full-Text Functions and Full-Text Predicates
Searching for Words or Phrases Close to Another Word or Phrase
(Proximity Term)

Querying Multiple Columns

Updated: 17 July 2006
In Microsoft SQL Server 2005, you can use the CONTAINS predicate to
query multiple columns by specifying a list of columns to search. The
columns must be from the same table.
The following query searches for the term Red in the Name and Color
fields of the Production.Product table.
Copy Code
Use AdventureWorks;
GO
SELECT Name, Color FROM Production.Product
WHERE CONTAINS((Name, Color), 'Red');

Performing Prefix Searches

You can use Full-Text Search to search for words or phrases with a
specified prefix.
When conducting a prefix search, all entries in the column that contain
text beginning with the specified prefix will be returned. For example,
to search for all rows that contain the prefix top-, as in topple,
topping, and top itself, the query looks like this:
Copy Code
USE AdventureWorks;
GO
SELECT Description, ProductDescriptionID
FROM Production.ProductDescription
WHERE CONTAINS (Description, ' "top*" ' );
GO
All text that matches the text specified before the asterisk (*) is
returned. If the text and asterisk are not delimited by double quotation
marks, as in CONTAINS (DESCRIPTION, 'top*'), full-text search does not
consider the asterisk to be a wildcard..
When the prefix term is a phrase, each token making up the phrase is
considered a separate prefix term. All rows that have words beginning
with the prefix terms will be returned. For example, the prefix term
"light bread*" will find rows with text of either "light breaded,"
"lightly breaded," or "light bread", but will not return "Lightly
toasted bread".

Querying Linked Servers

To issue full-text queries against linked servers, you must first create
a full-text index on the target tables and columns on the remote server.
Next you must add the remote server as a linked server. That being done,
you can use a four-part name in CONTAINS or FREETEXT full-text queries
to query the target tables and columns on the linked server.

Searching for the Inflectional Form of a Specific Word (Generation Term)

You can search for all the different tenses of a verb or both the
singular and plural forms of a noun. For example, the query shown in
this topic searches for any form of "foot" ("foot", "feet", and so on)
in the Comments column of the ProductReview table.
Note that Full-Text Search uses stemmers. Stemmers allow you to search
for the different tenses of a verb, or both the singular and plural
forms of a noun. For example, this query searches for any form of "foot"
("foot", "feet", and so on) in the Comments column of the ProductReview
table. For more information about stemmers, see Word Breakers and Stemmers.
Copy Code
USE AdventureWorks;
GO
SELECT Comments, ReviewerName
FROM Production.ProductReview
WHERE CONTAINS (Comments, 'FORMSOF(INFLECTIONAL, "foot")');
GO

Integrating Full-Text Search and Transact-SQL Predicates

The CONTAINS and FREETEXT predicates can be combined with any of the
other Transact-SQL predicates, such as LIKE and BETWEEN; they can also
be used in a subquery. This example searches for descriptions in which
the description ID is not equal to 5 and in which the description
contains the word "Aluminum" and the word "spindle."
Copy Code
USE AdventureWorks;
GO
SELECT Description
FROM Production.ProductDescription
WHERE ProductDescriptionID <> 5 AND
CONTAINS(Description, ' Aluminum AND spindle');
GO
The following query uses CONTAINS within a subquery. Using the
AdventureWorks database, the query obtains the comment value of all the
comments in the ProductReview table for a particular cycle.
Copy Code
USE AdventureWorks;
GO
INSERT INTO Production.ProductReview
(ProductID, ReviewerName, EmailAddress, Rating, Comments)
VALUES
(780, 'John Smith', 'john@fourthcoffee.com', 5,
'The Mountain-200 Silver from AdventureWorks Cycles meets and exceeds
expectations. I enjoyed the smooth ride down the roads of Redmond')

-- Given the full-text catalog for these tables is Adv_ft_ctlg,
-- with change_tracking on so that the full-text indexes are updated
automatically.
WAITFOR DELAY '00:00:30'
-- Wait 30 seconds to make sure that the full-text index gets updated.

SELECT r.Comments, p.Name
FROM Production.ProductReview r
JOIN Production.Product p
ON
r.ProductID = p.ProductID

AND r.ProductID = (SELECT ProductID
FROM Production.ProductReview
WHERE CONTAINS (Comments,
' AdventureWorks AND
Redmond AND
"Mountain-200 Silver" '))

Searching for Specific Word or Phrase (Simple Term)

You can use the CONTAINS predicate to search a table for a specific phrase.
For example, if you want to search the ProductReview table in the
AdventureWorks database to find all comments about a product with the
phrase "learning curve", you could use the CONTAINS predicate as follows.
Copy Code
USE AdventureWorks;
GO
SELECT Comments
FROM Production.ProductReview
WHERE CONTAINS(Comments, ' "learning curve" ');
GO
The CONTAINS predicate uses functional notation in which the first
parameter is the name of the column or list of columns being searched,
and the second parameter is a full-text search condition. The search
condition, in this case "learning curve", can be quite complex and can
be composed of one or more terms
Note:
The full-text search queries are case-insensitive. However, in Japanese,
there are multiple phonetic orthographies in which the concept of
orthographic normalization is akin to case insensitivity (for example,
kana = insensitivity). This type of orthographic normalization is not
supported.

Searching for Words or Phrases Using Weighted Values (Weighted Term)

You can search for words or phrases and specify a weighting value.
Weight, measured as a number from 0.0 through 1.0, indicates the degree
of importance for each word and phrase within a set of words and
phrases. A weight value of 0.0 is the lowest, and a weight value of 1.0
is the highest. For example, this query searches for all customer
addresses, using weight values, in which any text beginning with the
string "Bay" has either "Street" or "View". Microsoft SQL Server 2005
gives a higher rank to those rows with more of the words specified.
Copy Code
USE AdventureWorks
GO
SELECT AddressLine1, KEY_TBL.RANK
FROM Person.Address AS Address INNER JOIN
CONTAINSTABLE(Person.Address, AddressLine1, 'ISABOUT ("Bay*",
Street WEIGHT(0.9),
View WEIGHT(0.1)
) ' ) AS KEY_TBL
ON Address.AddressID = KEY_TBL.[KEY]
ORDER BY KEY_TBL.RANK
GO
A weighted term can be used in conjunction with any of the other four
types of terms, namely simple term, prefix term, generation term, and
proximity term.

Comparing Full-Text Functions and Full-Text Predicates

The CONTAINSTABLE and FREETEXTTABLE functions are used to specify
full-text queries that return relevance rankings for each row. These
functions are very similar but are used differently from the full-text
predicates, CONTAINS and FREETEXT.
Although both the full-text predicates and the full-text functions are
used for full-text queries, and the syntax used to specify the full-text
search condition is the same in both the predicates and the functions,
there are major differences in the way that these are used. The
following lists some important points of similarity and difference:
CONTAINS and FREETEXT both return a TRUE or FALSE value and are
specified in the WHERE or HAVING clauses of a SELECT statement.
CONTAINSTABLE and FREETEXTTABLE both return a table of zero, one, or
more rows, so they must always be specified in the FROM clause.
CONTAINS and FREETEXT can only be used to specify selection criteria,
which Microsoft SQL Server uses to determine the membership of the
result set.
CONTAINSTABLE and FREETEXTTABLE are also used to specify selection
criteria. The table returned has a column named KEY that contains
full-text key values. Each full-text registered table has a column whose
values are guaranteed to be unique. The values returned in the KEY
column of CONTAINSTABLE or FREETEXTTABLE are the unique values, from the
full-text registered table, of the rows that match the selection
criteria specified in the full-text search condition.
Furthermore, the table produced by CONTAINSTABLE and FREETEXTTABLE has a
column named RANK, which contains values from 0 through 1000. A lower
value indicates lower relevance. These values are used to rank the rows
returned according to how well they matched the selection criteria.
Note:
The rank value only indicates a relative order of relevance of the rows
in the result set. The actual value is of no importance and should not
be expected to be same each time the query is run. For more information
about ranking, see Understanding Ranking.
CONTAINS and FREETEXT queries do not return any rank values.
When running queries that use the CONTAINSTABLE and FREETEXTTABLE
functions qualifying rows returned must be explicitly joined with the
rows in the original SQL Server table.
The following example returns the description and category name of all
food categories for which the Description column contains the words
"sweet and savory" near either the word "sauces" or the word "candies."
All rows with a category name "Seafood" are disregarded. Only rows with
a rank value of 2 or higher are returned.
Note:
To run some of the examples in this topic, you will need to install the
Northwind database. For information on how to install the Northwind
database, see Downloading Northwind and pubs Sample Databases.
Copy Code
USE Northwind;
GO
SELECT FT_TBL.Description,
FT_TBL.CategoryName,
KEY_TBL.RANK
FROM Categories AS FT_TBL INNER JOIN
CONTAINSTABLE (Categories, Description,
'("sweet and savory" NEAR sauces) OR
("sweet and savory" NEAR candies)'
) AS KEY_TBL
ON FT_TBL.CategoryID = KEY_TBL.[KEY]
WHERE KEY_TBL.RANK > 2
AND FT_TBL.CategoryName <> 'Seafood'
ORDER BY KEY_TBL.RANK DESC;
GO
Comparison between CONTAINSTABLE and CONTAINS

The CONTAINSTABLE function and the CONTAINS predicate use similar search
conditions.
However, in CONTAINSTABLE you specify the table that is full-text
searched, the column (or all the columns) in the table to be searched,
and the search condition. There is an optional parameter that allows the
user to indicate that only the highest specified number of matches be
returned. For more information, see the "Limiting Result Sets" section
of this topic.
CONTAINSTABLE returns a table that includes a column named RANK. This
RANK column contains a value for each row that indicates how well a row
matched the selection criteria. The higher the rank value of a row the
more relevant is the row for the given full-text query.
Comparison between FREETEXTTABLE and FREETEXT

The following query extends a FREETEXTTABLE query to return the highest
ranked rows first and to add the ranking of each row to the select list.
To specify the query, you must know that CategoryID is the unique key
column for the Categories table.
Copy Code
USE Northwind;
GO
SELECT KEY_TBL.RANK, FT_TBL.Description
FROM Categories AS FT_TBL
INNER JOIN
FREETEXTTABLE(Categories, Description,
'How can I make my own beers and ales?') AS KEY_TBL
ON FT_TBL.CategoryID = KEY_TBL.[KEY]
ORDER BY KEY_TBL.RANK DESC;
GO
Here is an extension of the same query that only returns rows with a
rank value of 10 or greater:
Copy Code
USE Northwind;
GO
SELECT KEY_TBL.RANK, FT_TBL.Description
FROM Categories FT_TBL
INNER JOIN
FREETEXTTABLE (Categories, Description,
'How can I make my own beers and ales?') AS KEY_TBL
ON FT_TBL.CategoryID = KEY_TBL.[KEY]
WHERE KEY_TBL.RANK >= 10
ORDER BY KEY_TBL.RANK DESC;
GO
Identifying the Unique Key Column Name

When writing queries that use rowset-valued functions, it is necessary
to know the name of the unique key column. Each full-text enabled table
has the TableFulltextKeyColumn property that contains the column ID of
the column that has been selected for enforcing unique rows for the
table. This example shows how to obtain the name of the key column
programmatically.
Copy Code
USE AdventureWorks;
GO
DECLARE @key_column sysname
SET @key_column = Col_Name(Object_Id('Production.Document'),
ObjectProperty(Object_id('Production.Document'),
'TableFulltextKeyColumn')
)
SELECT @key_column AS 'Unique Key Column';
GO
Limiting Result Sets to Return the Most Relevant Results

In many full-text queries, the number of items matching the search
condition is very large. To prevent queries from returning too many
matches, use the optional argument, top_n_by_rank, in CONTAINSTABLE and
FREETEXTTABLE to specify the number of matches according to rank you
want returned.
Note:
Using the top_n_by_rank argument returns a subset of rows that satisfy
the full-text query. If top_n_by_rank is combined with other predicates,
the query could return fewer rows than the number of rows that actually
match all the predicates.
With this information, Microsoft SQL Server orders the matches by rank
and returns only up to the specified number. This choice can result in a
dramatic increase in performance. For example, a query that would
normally return 100,000 rows from a table of one million rows are
processed more quickly if only the top 100 rows are requested.
If you want only the top 3 matches returned on an earlier example using
CONTAINSTABLE, the query looks like the following:
Copy Code
USE Northwind;
GO
SELECT K.RANK, CompanyName, ContactName, Address
FROM Customers AS C
INNER JOIN
CONTAINSTABLE(Customers,Address, 'ISABOUT ("des*",
Rue WEIGHT(0.5),
Bouchers WEIGHT(0.9))', 3) AS K
ON C.CustomerID = K.[KEY];
GO
Here is the result set:
Copy Code
RANK CompanyName ContactName address
---- ------------ ----------- -------
123 Bon app' Laurence Lebihan 12, rue des Bouchers
65 Du monde entier Janine Labrune 67, rue des Cinquante Otages
15 France restauration Carine Schmitt 54, rue Royale
This example returns the description and category name of the top 10
food categories where the Description column contains the words "sweet
and savory" near either the word "sauces" or the word "candies."
Copy Code
SELECT FT_TBL.Description,
FT_TBL.CategoryName,
KEY_TBL.RANK
FROM Categories AS FT_TBL INNER JOIN
CONTAINSTABLE (Categories, Description,
'("sweet and savory" NEAR sauces) OR
("sweet and savory" NEAR candies)'
, 10
) AS KEY_TBL
ON FT_TBL.CategoryID = KEY_TBL.[KEY];
GO

Searching for Words or Phrases Close to Another Word or Phrase
(Proximity Term)

You can search for words or phrases in close proximity to another word
or phrase. In addition, you can specify two words or phrases in any
order and get the same result. This example searches for the word
"reflector" close to the word "bracket".
Copy Code
USE AdventureWorks;
GO
SELECT DocumentID, DocumentSummary, Document
FROM Production.Document AS DocTable INNER JOIN
CONTAINSTABLE(Production.Document, Document, '(Reflector NEAR Bracket)'
) AS KEY_TBL
ON DocTable.DocumentID = KEY_TBL.[KEY]
ORDER BY KEY_TBL.RANK;
GO
Notice that you can also reverse the terms in CONTAINSTABLE to get the
same result:
Copy Code
CONTAINSTABLE(Production.Document, Document, '(Bracket NEAR Reflector)'
) AS KEY_TBL
You can use the tilde character (~) in place of the NEAR keyword in the
earlier query, and get the same results:
Copy Code
CONTAINSTABLE(Production.Document, Document, '(Reflector ~ Bracket)' )
AS KEY_TBL
More than two words or phrases can be specified in the search
conditions. For example, it is possible to say:
Copy Code
CONTAINSTABLE(Production.Document, Document, '(Reflector ~ Bracket ~
Installation)' ) AS KEY_TBL
This means that "Reflector" should be in close proximity to "Bracket",
and "Bracket" should be in close proximity to "Installation".

Querying varbinary(max) and xml Columns

Updated: 17 July 2006
CONTAINS and FREETEXT predicates may be used to search indexed
varbinary(max) and image as well as xml columns.
Important:
The image data type will be removed in a future version of Microsoft SQL
Server. Avoid using this data type in new development work, and plan to
modify applications that currently use them. Use the varbinary(max) data
type instead.
Many document types can be stored in a single varbinary(max), or xml
column. Microsoft SQL Server 2005 supports certain document types and
provides a filter for these types. For a list of these document types,
query the sys.fulltext_document_types catalog view.
When a varbinary(max) or an xml column participates in a full-text
index, the full-text service looks at the extensions of the documents
contained in the varbinary(max) column and applies a corresponding
filter to interpret the binary data and extract the textual information
needed for full-text indexing and querying. For an xml column, the xml
filter is applied.
Once indexed, the varbinary(max) or xml column can be queried like any
other column in a table, using the predicates CONTAINS and FREETEXT
For information on how to insert file data into a varbinary(max) or
image column, see OPENROWSET (Transact-SQL).
CONTAINS (Transact-SQL)

Updated: 14 April 2006
Is a predicate used to search columns containing character-based data
types for precise or fuzzy (less precise) matches to single words and
phrases, the proximity of words within a certain distance of one
another, or weighted matches.
In SQL Server 2005, you can use four-part names in CONTAINS or FREETEXT
full-text predicates to execute queries against linked servers.
CONTAINS can search for:
A word or phrase.
The prefix of a word or phrase.
A word near another word.
A word inflectionally generated from another (for example, the word
drive is the inflectional stem of drives, drove, driving, and driven).
A word that is a synonym of another word using thesaurus (for example,
the word metal can have synonyms such as aluminum and steel).
Transact-SQL Syntax Conventions [

http://msdn2.microsoft.com/en-us/library/ms177563.aspx ]
Syntax

CONTAINS
( { column_name | (column_list) | * }
, '< contains_search_condition >'
[ , LANGUAGE language_term ]
)
< contains_search_condition > ::=
{ < simple_term >
| < prefix_term >
| < generation_term >
| < proximity_term >
| < weighted_term >
}
| { ( < contains_search_condition > )
[ { < AND > | < AND NOT > | < OR > } ]
< contains_search_condition > [ ...n ]
}
< simple_term > ::=
word | " phrase "
< prefix term > ::=
{ "word * " | "phrase *" }
< generation_term > ::=
FORMSOF ( { INFLECTIONAL | THESAURUS } , < simple_term > [ ,...n ] )
< proximity_term > ::=
{ < simple_term > | < prefix_term > }
{ { NEAR | ~ }
{ < simple_term > | < prefix_term > }
} [ ...n ]
< weighted_term > ::=
ISABOUT
( { {
< simple_term >
| < prefix_term >
| < generation_term >
| < proximity_term >
}
[ WEIGHT ( weight_value ) ]
} [ ,...n ]
)
< AND > ::=
{ AND | & }
< AND NOT > ::=
{ AND NOT | & !}
< OR > ::=
{ OR | | }
Arguments

column_name
Is the name of the column or columns included in the full-text index.
Columns of type char, varchar, nchar, nvarchar, text, ntext, image, xml,
and varbinary(max) are valid columns for full-text searching.
column_list
Indicates that several columns, separated by a comma, can be specified.
column_list must be enclosed in parentheses. Unless language_term is
specified, the language of all columns of column_list must be the same.
*
Specifies that all columns in the table registered for full-text
searching should be used to search for the given contains search
condition. The columns in the CONTAINS clause must come from a single
table. If more than one table is in the FROM clause, * must be qualified
by the table name. Unless language_term is specified, the language of
all columns of the table must be the same.
LANGUAGE language_term
Is the language in which the user is issuing the query. If the column
contains several documents stored as binary large objects (BLOBs), the
language used to index the actual document content is determined by the
locale identifier (LCID) of the document. Consequently, a single column
can be storing multiple languages. The LANGUAGE parameter allows users
to specify the language in which they are searching, thus increasing the
probability of a good match.
This parameter is optional and can be specified as a string, integer, or
hexadecimal value corresponding to the LCID of a language. If
language_term is specified, the language it represents will be applied
to all elements of the search condition. If no value is specified, the
column full-text language is used.
When specified as a string, language_term corresponds to the alias
column value in the syslanguages system table. The string must be
enclosed in single quotation marks, as in 'language_term'. When
specified as an integer, language_term is the actual LCID that
identifies the language. When specified as a hexadecimal value,
language_term is 0x followed by the hexadecimal value of the LCID. The
hexadecimal value must not exceed eight digits, including leading zeros.
If the value is in double-byte character set (DBCS) format, SQL Server
will convert it to Unicode.
If the language specified is not valid or there are no resources
installed that correspond to that language, SQL Server returns an error.
To use the neutral language resources, specify 0x0 as language_term.
<contains_search_condition>
Specifies the text to search for in column_name and the conditions for a
match.
contains_search_condition is nvarchar. An implicit conversion occurs
when another character data type is used as input. In the following
example, the @SearchWord variable, which is defined as varchar(30),
causes an implicit conversion in the CONTAINS predicate.
Copy Code
USE AdventureWorks;
GO
DECLARE @SearchWord varchar(30)
SET @SearchWord ='performance'
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, @SearchWord);
Because "parameter sniffing" does not work across conversion, use
nvarchar for better performance. In the example, declare @SearchWord as
nvarchar(30).
Copy Code
USE AdventureWorks;
GO
DECLARE @SearchWord nvarchar(30)
SET @SearchWord = N'performance'
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, @SearchWord);
You can also use the OPTIMIZE FOR query hint for cases in which a non
optimal plan is generated.
word
Is a string of characters without spaces or punctuation.
phrase
Is one or more words with spaces between each word.
Note:
Some languages, such as those written in some parts of Asia, can have
phrases that consist of one or more words without spaces between them.
<simple_term>
Specifies a match for an exact word or a phrase. Examples of valid
simple terms are "blue berry", blueberry, and "Microsoft SQL Server".
Phrases should be enclosed in double quotation marks (""). Words in a
phrase must appear in the same order as specified in
<contains_search_condition> as they appear in the database column. The
search for characters in the word or phrase is not case sensitive. Noise
words (such as a, and, or the) in full-text indexed columns are not
stored in the full-text index. If a noise word is used in a single word
search, SQL Server returns an error message indicating that the query
contains only noise words. SQL Server includes a standard list of noise
words in the directory \Mssql\Binn\FTERef of each instance of SQL Server.
Punctuation is ignored. Therefore, CONTAINS(testing, "computer failure")
matches a row with the value, "Where is my computer? Failure to find it
would be expensive." For more information on word-breaker behavior, see
Word Breakers and Stemmers [

http://msdn2.microsoft.com/en-us/library/ms142509.aspx ] .
<prefix_term>
Specifies a match of words or phrases beginning with the specified text.
Enclose a prefix term in double quotation marks ("") and add an asterisk
(*) before the ending quotation mark, so that all text starting with the
simple term specified before the asterisk is matched. The clause should
be specified this way: CONTAINS (column, '"text*"'). The asterisk
matches zero, one, or more characters (of the root word or words in the
word or phrase). If the text and asterisk are not delimited by double
quotation marks, so the predicate reads CONTAINS (column, 'text*'),
full-text search considers the asterisk as a character and searches for
exact matches to text*. The full-text engine will not find words with
the asterisk (*) character because word breakers typically ignore such
characters.
When <prefix_term> is a phrase, each word contained in the phrase is
considered to be a separate prefix. Therefore, a query specifying a
prefix term of "local wine*" matches any rows with the text of "local
winery", "locally wined and dined", and so on.
<generation_term>
Specifies a match of words when the included simple terms include
variants of the original word for which to search.
INFLECTIONAL
Specifies that the language-dependent stemmer is to be used on the
specified simple term. Stemmer behavior is defined based on stemming
rules of each specific language. The neutral language does not have an
associated stemmer. The column language of the column(s) being queried
is used to refer to the desired stemmer. If language_term is specified,
the stemmer corresponding to that language is used.
A given <simple_term> within a <generation_term> will not match both
nouns and verbs.
THESAURUS
Specifies that the thesaurus corresponding to the column full-text
language, or the language specified in the query is used. The longest
pattern or patterns from the simple_term are matched against the
thesaurus and additional terms are generated to expand or replace the
original pattern. If a match is not found for all or part of the
simple_term, the non-matching portion is treated as a simple_term. For
more information on the Full-Text Search thesaurus, see Thesaurus [

http://msdn2.microsoft.com/en-us/library/ms142491.aspx ] .
<proximity_term>
Specifies a match of words or phrases that must be close to one another.
<proximity_term> operates similarly to the AND operator: both require
that more than one word or phrase exist in the column being searched. As
the words in <proximity_term> appear closer together, the better the match.
NEAR | ~
Indicates that the word or phrase on the left side of the NEAR or ~
operator should be approximately close to the word or phrase on the
right side of the NEAR or ~ operator. Multiple proximity terms can be
chained, for example:
Copy Code
a NEAR b NEAR c
This means that word or phrase a should be near word or phrase b, which
should be near word or phrase c. When several proximity terms are
chained, all proximity terms must be near each other. So, in the example
a ~ b ~ c, word or phrase a should also be near word or phrase c.
<weighted_term>
Specifies that the matching rows (returned by the query) match a list of
words and phrases, each optionally given a weighting value.
ISABOUT
Specifies the <weighted_term> keyword.
WEIGHT(weight_value)
Specifies a weight value, which is a number from 0.0 through 1.0. Each
component in <weighted_term> may include a weight_value. weight_value is
a way to change how various portions of a query affect the rank value
assigned to each row matching the query. WEIGHT does not affect the
results of CONTAINS queries, but WEIGHT impacts rank in CONTAINSTABLE
queries. For more information, see CONTAINSTABLE (Transact-SQL) [

http://msdn2.microsoft.com/en-us/library/ms189760.aspx ] .
Note:
The decimal separator depends on the operating-system locale. Therefore,
the decimal separator can be a character other than the period ".", for
example it can be a comma ",".
{ AND | & } | { AND NOT | &! } | { OR | | }
Specifies a logical operation between two contains search conditions.
AND | &
Indicates that the two contains search conditions must be met for a
match. The ampersand symbol (&) may be used instead of the AND keyword
to represent the AND operator.
AND NOT | &!
Indicates that the second search condition must not be present for a
match. The ampersand followed by the exclamation mark symbol (&!) may be
used instead of the AND NOT keyword to represent the AND NOT operator.
OR | |
Indicates that either of the two contains search conditions must be met
for a match. The bar symbol (|) may be used instead of the OR keyword to
represent the OR operator.
When <contains_search_condition> contains parenthesized groups, these
parenthesized groups are evaluated first. After evaluating parenthesized
groups, these rules apply when using these logical operators with
contains search conditions:
NOT is applied before AND.
NOT can only occur after AND, as in AND NOT. The OR NOT operator is not
allowed. NOT cannot be specified before the first term. For example,
CONTAINS (mycolumn, 'NOT "phrase_to_search_for" ' ) is not valid.
AND is applied before OR.
Boolean operators of the same type (AND, OR) are associative and can
therefore be applied in any order.
n
Is a placeholder indicating that multiple contains search conditions and
terms within them can be specified.
Remarks

CONTAINS is not recognized as a keyword if the compatibility level is
less than 70. For more information, see sp_dbcmptlevel (Transact-SQL) [

http://msdn2.microsoft.com/en-us/library/ms178653.aspx ] .
Examples

A. Using CONTAINS with <simple_term>
The following example finds all products with a price of $80.99 that
contain the word "Mountain".
Copy Code
USE AdventureWorks;
GO
SELECT Name, ListPrice
FROM Production.Product
WHERE ListPrice = 80.99
AND CONTAINS(Name, 'Mountain');
GO
B. Using CONTAINS and phrase in <simple_term>
The following example returns all products that contain either the
phrase "Mountain" or "Road".
Copy Code
USE AdventureWorks;
GO
SELECT Name
FROM Production.Product
WHERE CONTAINS(Name, ' "Mountain" OR "Road" ')
GO
C. Using CONTAINS with <prefix_term>
The following example returns all product names with at least one word
starting with the prefix chain in the Name column.
Copy Code
USE AdventureWorks;
GO
SELECT Name
FROM Production.Product
WHERE CONTAINS(Name, ' "Chain*" ');
GO
D. Using CONTAINS and OR with <prefix_term>
The following example returns all category descriptions containing
strings with prefixes of either "chain" or "full".
Copy Code
USE AdventureWorks;
GO
SELECT Name
FROM Production.Product
WHERE CONTAINS(Name, '"chain*" OR "full*"');
GO
E. Using CONTAINS with <proximity_term>
The following example returns all product names that have the word bike
near the word performance.
Copy Code
USE AdventureWorks;
GO
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, 'bike NEAR performance');
GO
F. Using CONTAINS with <generation_term>
The following example searches for all products with words of the form
ride: riding, ridden, and so on.
Copy Code
USE AdventureWorks;
GO
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, ' FORMSOF (INFLECTIONAL, ride) ');
GO
G. Using CONTAINS with <weighted_term>
The following example searches for all product names containing the
words performance, comfortable, or smooth, and different weightings are
given to each word.
Copy Code
USE AdventureWorks;
GO
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, 'ISABOUT (performance weight (.8),
comfortable weight (.4), smooth weight (.2) )' );
GO
H. Using CONTAINS with variables
The following example uses a variable instead of a specific search term.
Copy Code
USE AdventureWorks;
GO
DECLARE @SearchWord nvarchar(30)
SET @SearchWord = N'Performance'
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, @SearchWord);
GO
FREETEXT (Transact-SQL)

Updated: 14 April 2006
Is a predicate used to search columns containing character-based data
types for values that match the meaning and not the exact wording of the
words in the search condition. When FREETEXT is used, the full-text
query engine internally performs the following actions on the
freetext_string, assigns each term a weight, and then finds the matches.
Separates the string into individual words based on word boundaries
(word-breaking).
Generates inflectional forms of the words (stemming).
Identifies a list of expansions or replacements for the terms based on
matches in the thesaurus.
Transact-SQL Syntax Conventions [

http://msdn2.microsoft.com/en-us/library/ms177563.aspx ]
Syntax

FREETEXT ( { column_name | (column_list) | * }
, 'freetext_string' [ , LANGUAGE language_term ] )
Arguments

column_name
Is the name of the column that has been registered for full-text
searching. Columns of type char, varchar, nchar, nvarchar, text, ntext,
image, xml, and varbinary(max) are valid columns for full-text searching.
column_list
Indicates that several columns, separated by a comma, can be specified.
column_list must be enclosed in parentheses. Unless language_term is
specified, the language of all columns of column_list must be the same.
*
Specifies that all columns that have been registered for full-text
searching should be used to search for the given freetext_string. If
more than one table is in the FROM clause, * must be qualified by the
table name. Unless language_term is specified, the language of all
columns of the table must be the same.
freetext_string
Is text to search for in the column_name. Any text, including words,
phrases or sentences, can be entered. Matches are generated if any term
or the forms of any term is found in the full-text index.
freetext_string is nvarchar. An implicit conversion occurs when another
character data type is used as input. In the following example, the
@SearchWord variable, which is defined as varchar(30), causes an
implicit conversion in the FREETEXT predicate.
Copy Code
USE AdventureWorks;
GO
DECLARE @SearchWord varchar(30)
SET @SearchWord ='performance'
SELECT Description
FROM Production.ProductDescription
WHERE FREETEXT(Description, @SearchWord);
Because "parameter sniffing" does not work across conversion, use
nvarchar for better performance. In the example, declare @SearchWord as
nvarchar(30).
Copy Code
USE AdventureWorks;
GO
DECLARE @SearchWord nvarchar(30)
SET @SearchWord = N'performance'
SELECT Description
FROM Production.ProductDescription
WHERE FREETEXT(Description, @SearchWord);
You can also use the OPTIMIZE FOR query hint for cases in which a
nonoptimal plan is generated.
Unlike in the CONTAINS search condition where AND is a keyword, when
used in freetext_string the word 'and' is considered a noise word and
will be discarded.
Use of WEIGHT, FORMSOF, wildcards, NEAR and other syntax is not allowed.
freetext_string is wordbroken, stemmed, and passed through the
thesaurus. If freetext_string is enclosed in double quotation marks, a
phrase match is instead performed; stemming and thesaurus are not
performed.
LANGUAGE language_term
Is the language whose resources will be used for wordbreaking, stemming,
and thesaurus and noise-word removal as part of the FREETEXT query. This
parameter is optional and can be specified as a string, integer, or
hexadecimal value corresponding to the locale identifier (LCID) of a
language. If language_term is specified, the language it represents will
be applied to all elements of the search condition. If no value is
specified, the column full-text language is used.
When specified as a string, language_term corresponds to the alias
column value in the syslanguages system table. The string must be
enclosed in single quotation marks, as in 'language_term'. When
specified as an integer, language_term is the actual LCID that
identifies the language. When specified as a hexadecimal value,
language_term is 0x followed by the hexadecimal value of the LCID. The
hexadecimal value must not exceed eight digits, including leading zeros.
If the value is in double-byte character set (DBCS) format, Microsoft
SQL Server will convert it to Unicode.
If the language specified is not valid or there are no resources
installed that correspond to that language, Microsoft SQL Server returns
an error. To use the neutral language resources, specify 0x0 as
language_term.
Remarks

Full-text queries using FREETEXT are less precise than those full-text
queries using CONTAINS. The SQL Server full-text search engine
identifies important words and phrases. No special meaning is given to
any of the reserved keywords or wildcard characters that typically have
meaning when specified in the <contains_search_condition> parameter of
the CONTAINS predicate.
FREETEXT is not recognized as a keyword if the compatibility level is
less than 70. For more information, see sp_dbcmptlevel (Transact-SQL) [

http://msdn2.microsoft.com/en-us/library/ms178653.aspx ] .
Examples

A. Using FREETEXT to search for words containing specified character values
The following example searches for all documents containing the words
related to vital, safety, components.
Copy Code
USE AdventureWorks;
GO
SELECT Title
FROM Production.Document
WHERE FREETEXT (Document, 'vital safety components' );
GO
B. Using FREETEXT with variables
The following example uses a variable instead of a specific search term.
Copy Code
USE AdventureWorks;
GO
DECLARE @SearchWord nvarchar(30);
SET @SearchWord = N'high-performance';
SELECT Description
FROM Production.ProductDescription
WHERE FREETEXT(Description, @SearchWord);
GO
sys.fulltext_document_types

Returns a row for each document type that is available for full-text
indexing operations. Each row represents the IFilter interface that is
registered in the instance of Microsoft SQL Server 2005.
Column name
Data type
Description
document_type
sysname
The file extension of the supported document type.
This value can be used to identify the filter that will be used during
full-text indexing of columns of type varbinary(max) or image.
class_id
uniqueidentifier
GUID of the IFilter class that supports file extension.
path
nvarchar(260)
The path to the IFilter DLL. The path is only visible to members of the
serveradmin fixed server role.
version
sysname
Version of the IFilter DLL.
manufacturer
sysname
Name of the IFilter manufacturer.
OPENROWSET (Transact-SQL)

Updated: 1 February 2007
Includes all connection information that is required to access remote
data from an OLE DB data source. This method is an alternative to
accessing tables in a linked server and is a one-time, ad hoc method of
connecting and accessing remote data by using OLE DB. For more frequent
references to OLE DB data sources, use linked servers instead. For more
information, see Linking Servers [

http://msdn2.microsoft.com/en-us/library/ms188279.aspx ] . The
OPENROWSET function can be referenced in the FROM clause of a query as
if it were a table name. The OPENROWSET function can also be referenced
as the target table of an INSERT, UPDATE, or DELETE statement, subject
to the capabilities of the OLE DB provider. Although the query might
return multiple result sets, OPENROWSET returns only the first one.
OPENROWSET also supports bulk operations through a built-in BULK
provider that enables data from a file to be read and returned as a rowset.
Transact-SQL Syntax Conventions [

http://msdn2.microsoft.com/en-us/library/ms177563.aspx ]
Syntax

OPENROWSET
( { 'provider_name' , { 'datasource' ; 'user_id' ; 'password'
| 'provider_string' }
, { [ catalog. ] [ schema. ] object
| 'query'
}
| BULK 'data_file' ,
{ FORMATFILE = 'format_file_path' [ <bulk_options> ]
| SINGLE_BLOB | SINGLE_CLOB | SINGLE_NCLOB }
} )

<bulk_options> ::=
[ , CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
[ , ERRORFILE = 'file_name' ]
[ , FIRSTROW = first_row ]
[ , LASTROW = last_row ]
[ , MAXERRORS = maximum_errors ]
[ , ROWS_PER_BATCH = rows_per_batch ]
Arguments

' provider_name '
Is a character string that represents the friendly name (or PROGID) of
the OLE DB provider as specified in the registry. provider_name has no
default value.
' datasource '
Is a string constant that corresponds to a particular OLE DB data
source. datasource is the DBPROP_INIT_DATASOURCE property to be passed
to the IDBProperties interface of the provider to initialize the
provider. Typically, this string includes the name of the database file,
the name of a database server, or a name that the provider understands
to locate the database or databases.
' user_id '
Is a string constant that is the user name passed to the specified OLE
DB provider. user_id specifies the security context for the connection
and is passed in as the DBPROP_AUTH_USERID property to initialize the
provider. user_id cannot be a Microsoft Windows login name.
' password '
Is a string constant that is the user password to be passed to the OLE
DB provider. password is passed in as the DBPROP_AUTH_PASSWORD property
when initializing the provider. password cannot be a Microsoft Windows
password.
' provider_string '
Is a provider-specific connection string that is passed in as the
DBPROP_INIT_PROVIDERSTRING property to initialize the OLE DB provider.
provider_string typically encapsulates all the connection information
required to initialize the provider. For a list of keywords that are
recognized by the SQL Native Client OLE DB provider, see Initialization
and Authorization Properties [

http://msdn2.microsoft.com/en-us/library/ms131377.aspx ] .
catalog
Is the name of the catalog or database in which the specified object
resides.
schema
Is the name of the schema or object owner for the specified object.
object
Is the object name that uniquely identifies the object to work with.
' query '
Is a string constant sent to and executed by the provider. The local
instance of SQL Server does not process this query, but processes query
results returned by the provider, a pass-through query. Pass-through
queries are useful when used on providers that do not make available
their tabular data through table names, but only through a command
language. Pass-through queries are supported on the remote server, as
long as the query provider supports the OLE DB Command object and its
mandatory interfaces. For more information, see SQL Native Client (OLE
DB) Reference [ http://msdn2.microsoft.com/en-us/library/ms131265.aspx ] .
BULK
Uses the BULK rowset provider for OPENROWSET to read data from a file.
In SQL Server 2005, OPENROWSET can read from a data file without loading
the data into a target table. This lets you use OPENROWSET with a simple
SELECT statement.
The arguments of the BULK option allow for significant control over
where to start and end reading data, how to deal with errors, and how
data is interpreted. For example, you can specify that the data file be
read as a single-row, single-column rowset of type varbinary, varchar,
or nvarchar. The default behavior is described in the argument
descriptions that follow.
For information on how to use the BULK option, see "Remarks," later in
this topic. For information about the permissions that are required by
the BULK option, see "Permissions," later in this topic.
Note:
When used to import data with the full recovery model, OPENROWSET (BULK
...) does not optimize logging.
For information on preparing data for bulk import, see Preparing Data
for Bulk Export or Import [

http://msdn2.microsoft.com/en-us/library/ms188609.aspx ] .
' data_file '
Is the full path of the data file whose data is to be copied into the
target table.
FORMATFILE = 'format_file_path'
Specifies the full path of a format file. SQL Server 2005 supports two
types of format files: XML and non-XML.
A format file is required to define column types in the result set. The
only exception is when SINGLE_CLOB, SINGLE_BLOB, or SINGLE_NCLOB is
specified; in which case, the format file is not required.
For information about format files, see Using a Format File to Bulk
Import Data [ http://msdn2.microsoft.com/en-us/library/ms178129.aspx ] .
< bulk_options >
Specifies one or more arguments for the BULK option.
CODEPAGE = { 'ACP '| 'OEM '| 'RAW '| 'code_page' }
Specifies the code page of the data in the data file. CODEPAGE is
relevant only if the data contains char, varchar, or text columns with
character values more than 127 or less than 32.
CODEPAGE value
Description
ACP
Converts columns of char, varchar, or text data type from the
ANSI/Microsoft Windows code page (ISO 1252) to the SQL Server code page.
OEM (default)
Converts columns of char, varchar, or text data type from the system OEM
code page to the SQL Server code page.
RAW
No conversion occurs from one code page to another. This is the fastest
option.
code_page
Indicates the source code page on which the character data in the data
file is encoded; for example, 850. The code page is required for the SQL
Server 2005 Database Engine to correctly interpret the input data.
ERRORFILE = 'file_name'
Specifies the file used to collect rows that have formatting errors and
cannot be converted to an OLE DB rowset. These rows are copied into this
error file from the data file "as is."
The error file is created at the start of the command execution. An
error will be raised if the file already exists. Additionally, a control
file that has the extension .ERROR.txt is created. This file references
each row in the error file and provides error diagnostics. After the
errors have been corrected, the data can be loaded.
FIRSTROW = first_row
Specifies the number of the first row to load. The default is 1. This
indicates the first row in the specified data file. The row numbers are
determined by counting the row terminators.
LASTROW = last_row
Specifies the number of the last row to load. The default is 0. This
indicates the last row in the specified data file.
MAXERRORS = maximum_errors
Specifies the maximum number of syntax errors or nonconforming rows, as
defined in the format file, that can occur before OPENROWSET throws an
exception. Until MAXERRORS is reached, OPENROWSET ignores each bad row,
not loading it, and counts the bad row as one error.
The default for maximum_errors is 10.
Note:
MAX_ERRORS does not apply to CHECK constraints, or to converting money
and bigint data types.
ROWS_PER_BATCH = rows_per_batch
Specifies the approximate number of rows of data in the data file. This
value should be of the same order as the actual number of rows.
OPENROWSET always imports a data file as a single batch. However, if you
specify rows_per_batch with a value > 0, the query processor uses the
value of rows_per_batch as a hint for allocating resources in the query
plan.
By default, ROWS_PER_BATCH is unknown. Specifying ROWS_PER_BATCH = 0 is
the same as omitting ROWS_PER_BATCH.
SINGLE_BLOB
Returns the contents of data_file as a single-row, single-column rowset
of type varbinary(max).
Important:
We recommend that you import XML data only using the SINGLE_BLOB option,
rather than SINGLE_CLOB and SINGLE_NCLOB, because only SINGLE_BLOB
supports all Windows encoding conversions.
SINGLE_CLOB
By reading data_file as ASCII, returns the contents as a single-row,
single-column rowset of type varchar(max), using the collation of the
current database.
SINGLE_NCLOB
By reading data_file as UNICODE, returns the contents as a single-row,
single-column rowset of type nvarchar(max), using the collation of the
current database.
Remarks

OPENROWSET can be used to access remote data from OLE DB data sources
only when the DisallowAdhocAccess registry option is explicitly set to 0
for the specified provider, and the Ad Hoc Distributed Queries advanced
configuration option is enabled. When these options are not set, the
default behavior does not allow for ad hoc access.
When accessing remote OLE DB data sources, the login identity of trusted
connections is not automatically delegated from the server on which the
client is connected to the server that is being queried. Authentication
delegation must be configured. For more information, see Configuring
Linked Servers for Delegation [

http://msdn2.microsoft.com/en-us/library/ms189580.aspx ] .
Catalog and schema names are required if the OLE DB provider supports
multiple catalogs and schemas in the specified data source. Values for
catalog and schema can be omitted when the OLE DB provider does not
support them. If the provider supports only schema names, a two-part
name of the form schema.object must be specified. If the provider
supports only catalog names, a three-part name of the form
catalog.schema.object must be specified. Three-part names must be
specified for pass-through queries that use the SQL Native Client OLE DB
provider. For more information, see Transact-SQL Syntax Conventions
(Transact-SQL) [ http://msdn2.microsoft.com/en-us/library/ms177563.aspx ] .
OPENROWSET does not accept variables for its arguments.
Using OPENROWSET with the BULK Option
The following Transact-SQL enhancements support the OPENROWSET(BULK…)
function:
A FROM clause that is used with SELECT can call OPENROWSET(BULK…)
instead of a table name, with full SELECT functionality.
OPENROWSET with the BULK option requires a correlation name, also known
as a range variable or alias, in the FROM clause. Column aliases can be
specified. If a column alias list is not specified, the format file must
have column names. Specifying column aliases overrides the column names
in the format file, such as:
FROM OPENROWSET(BULK...) AS table_alias
FROM OPENROWSET(BULK...) AS table_alias(column_alias,...n)
A SELECT…FROM OPENROWSET(BULK...) statement queries the data in a file
directly, without importing the data into a table. SELECT…FROM
OPENROWSET(BULK…) statements can also list bulk-column aliases by using
a format file to specify column names, and also data types.
An INSERT...SELECT * FROM OPENROWSET(BULK...) statement bulk imports
data from a data file into a SQL Server table. For more information, see
Importing Bulk Data by Using BULK INSERT or OPENROWSET(BULK...) [

http://msdn2.microsoft.com/en-us/library/ms175915.aspx ] .
When the OPENROWSET BULK option is used with an INSERT statement, the
BULK clause supports table hints. In addition to the regular table
hints, such as TABLOCK, the BULK clause can accept the following
specialized table hints: IGNORE_CONSTRAINTS (ignores only the CHECK and
FOREIGN KEY constraints), IGNORE_TRIGGERS, KEEPDEFAULTS, and
KEEPIDENTITY. For more information, see Table Hint (Transact-SQL) [

http://msdn2.microsoft.com/en-us/library/ms187373.aspx ] .
For information about how to use INSERT...SELECT * FROM
OPENROWSET(BULK...) statements, see Importing and Exporting Bulk Data [

http://msdn2.microsoft.com/en-us/library/ms175937.aspx ] . For
information about when row-insert operations that are performed by bulk
import are logged in the transaction log, see Prerequisites for Minimal
Logging in Bulk Import [

http://msdn2.microsoft.com/en-us/library/ms190422.aspx ] .
Note:
When you use OPENROWSET, it is important to understand how SQL Server
2005 handles impersonation. For information about security
considerations, see Importing Bulk Data by Using BULK INSERT or
OPENROWSET(BULK...) [

http://msdn2.microsoft.com/en-us/library/ms175915.aspx ] .
Bulk Exporting or Importing SQLXML Documents
To bulk export or import SQLXML data, use one of the following data
types in your format file.
Data type
Effect
SQLCHAR or SQLVARYCHAR
The data is sent in the client code page or in the code page implied by
the collation).
SQLNCHAR or SQLNVARCHAR
The data is sent as Unicode.
SQLBINARY or SQLVARYBIN
The data is sent without any conversion.
Permissions

OPENROWSET permissions are determined by the permissions of the user
name that is being passed to the OLE DB provider. To use the BULK option
requires ADMINISTER BULK OPERATIONS permission.
Examples

A. Using OPENROWSET with SELECT and the SQL Native Client OLE DB Provider
The following example uses the SQL Native Client OLE DB provider
(SQLNCLI) to access the HumanResources.Department table in the
AdventureWorks database on the remote server Seattle1. A SELECT
statement is used to define the row set returned. The provider string
contains the Server and Trusted_Connection keywords. These keywords are
recognized by the SQL Native Client OLE DB provider.
Copy Code
SELECT a.*
FROM OPENROWSET('SQLNCLI', 'Server=Seattle1;Trusted_Connection=yes;',
'SELECT GroupName, Name, DepartmentID
FROM AdventureWorks.HumanResources.Department
ORDER BY GroupName, Name') AS a;
B. Using the Microsoft OLE DB Provider for Jet
The following example accesses the Customers table in the Microsoft
Access Northwind database through the Microsoft OLE DB Provider for Jet.
Note:
This example assumes that Access is installed. To run this example, you
must install the Northwind database. For information about how to
install the Northwind database, see Downloading Northwind and pubs
Sample Databases [

http://msdn2.microsoft.com/en-us/library/ms143221.aspx ] .
Copy Code
SELECT CustomerID, CompanyName
FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'C:\Program Files\Microsoft Office\OFFICE11\SAMPLES\Northwind.mdb';
'admin';'',Customers)
GO
C. Using OPENROWSET and another table in an INNER JOIN
The following example selects all data from the Customers table from the
local instance of SQL Server Northwind database and from the Orders
table from the Access Northwind database stored on the same computer.
Note:
This example assumes that Access is installed. To run this example, you
must install the Northwind database. For information about how to
install the Northwind database, see Downloading Northwind and pubs
Sample Databases [

http://msdn2.microsoft.com/en-us/library/ms143221.aspx ] .
Copy Code
USE Northwind
GO
SELECT c.*, o.*
FROM Northwind.dbo.Customers AS c
INNER JOIN OPENROWSET('Microsoft.Jet.OLEDB.4.0',
'C:\Program Files\Microsoft
Office\OFFICE11\SAMPLES\Northwind.mdb';'admin';'', Orders)
AS o
ON c.CustomerID = o.CustomerID
GO
D. Using OPENROWSET to bulk insert file data into a varbinary(max) column
The following example creates a small table for demonstration purposes,
and inserts file data from a file named Text1.txt located in the C: root
directory into a varbinary(max) column.
Copy Code
USE AdventureWorks
GO
CREATE TABLE myTable(FileName nvarchar(60),
FileType nvarchar(60), Document varbinary(max))
GO

INSERT INTO myTable(FileName, FileType, Document)
SELECT 'Text1.txt' AS FileName,
'.txt' AS FileType,
* FROM OPENROWSET(BULK N'C:\Text1.txt', SINGLE_BLOB) AS Document
GO
E. Using the OPENROWSET BULK provider with a format file to retrieve
rows from a text file
The following example uses a format file to retrieve rows from a
tab-delimited text file, values.txt that contains the following data:
Copy Code
1 Data Item 1
2 Data Item 2
3 Data Item 3
The format file, values.fmt, describes the columns in values.txt:
Copy Code
9.0
2
1 SQLCHAR 0 10 "\t" 1 ID SQL_Latin1_General_Cp437_BIN
2 SQLCHAR 0 40 "\r\n" 2 Description SQL_Latin1_General_Cp437_BIN
This is the query that retrieves that data:
Copy Code
SELECT a.* FROM OPENROWSET( BULK 'c:\test\values.txt',
FORMATFILE = 'c:\test\values.fmt') AS a;
Additional Examples
For additional examples of using INSERT...SELECT * FROM
OPENROWSET(BULK...), see the following topics:
Examples of Bulk Importing and Exporting XML Documents [

http://msdn2.microsoft.com/en-us/library/ms191184.aspx ]
Keeping Identity Values When Bulk Importing Data [

http://msdn2.microsoft.com/en-us/library/ms186335.aspx ]
Keeping Nulls or Using Default Values During Bulk Import [

http://msdn2.microsoft.com/en-us/library/ms187887.aspx ]
Using a Format File to Bulk Import Data [

http://msdn2.microsoft.com/en-us/library/ms178129.aspx ]
Using Character Format to Import or Export Data [

http://msdn2.microsoft.com/en-us/library/ms190919.aspx ]
Using a Format File to Skip a Table Column [

http://msdn2.microsoft.com/en-us/library/ms179250.aspx ]
Using a Format File to Skip a Data Field [

http://msdn2.microsoft.com/en-us/library/ms187908.aspx ]
Using a Format File to Map Table Columns to Data-File Fields [

http://msdn2.microsoft.com/en-us/library/ms191175.aspx ]
Building Search Applications for the Web Using Microsoft SQL Server 2000
Full-Text Search

Andrew B. Cencini
Microsoft Corporation
December 2002
Applies to:
Microsoft® SQL™ Server 2000
Summary: Learn how to best utilize SQL Server 2000's Full-Text Search.
This article includes several tips and tricks for maximum throughput and
performance. (17 printed pages)
Contents
Introduction
A Little About Full-Text Search
Configuring Full-Text Search
Full-Text Queries
Ranking and Optimizations
Additional Performance Tricks
Conclusion
Appendix A: Implementing Best Bets for Full-Text Searches
Appendix B: A Sample Application Using Best Bets, Results Paging, and
Efficient Full-Text Query Logic
Appendix C: Resources
Introduction
The Full-Text Search feature of Microsoft® SQL™ Server 2000 allows you
to perform fast and flexible queries against indexes built on
unstructured text data. A common use of Full-Text Search is that of the
search engine for web sites. There are a number of concepts and
abstractions that are useful in understanding the best way to utilize
Full-Text Search, as well as several tips and tricks for optimizing your
Full-Text indexes and queries for maximum throughput and performance.
A Little About Full-Text Search
Full-Text Search, as a feature, was introduced in SQL Server 7.0. The
core engine of Full-Text Search is provided by means of Microsoft Search
(MSSearch) technology, which is also used in products such as Microsoft
Exchange, and Microsoft SharePoint™ Portal Server.
The functionality exposed in SQL Server 7.0 Full-Text Search provides
basic text search abilities, and uses an earlier version of MSSearch.
SQL Server 2000's Full-Text Search implementation delivers a robust set
of index and query features, and several enhancements in addition to
those included with SQL Server 7.0. Among those enhancements are: full
support for clustering via Microsoft Clustering Service, the ability to
filter and index documents stored in IMAGE columns, improved language
support, and performance, scalability and reliability improvements.
MSSearch builds, maintains and queries Full-Text indexes stored in the
file system (as opposed to inside of SQL Server). The logical and
physical storage unit used for Full-Text indexes by MSSearch is a
catalog. A Full-Text catalog contains one or more Full-Text indexes per
database—one Full-Text index may be created per table in SQL Server, and
you may include one or more columns from that table in the index. Each
table may belong to only one catalog, and only one index may be created
on each table. We'll get into best practices in terms of organizing your
Full-Text catalogs and indexes in a little bit—but first, a little more
on how Full-Text Search works.
Configuring Full-Text Search
To create a Full-Text index on your text data stored in SQL Server,
there are a few steps you should take to get ready. The first step is to
Full-Text enable the database that contains textual data you wish to
index (if you have not done so already).
Caution Executing the following statement will drop and re-create all
Full-Text catalogs belonging to the database on which you wish to enable
Full-Text Search. Be sure you haven't already created any Full-Text
catalogs in the particular database you are enabling unless you want
them rebuilt.
Provided you are a member of the sysadmin role or db_owner for that
particular database, you can go ahead and issue the following statement:
use Northwind
exec sp_fulltext_database 'enable'
Next, you will want to create a Full-Text catalog to store your
Full-Text indexes. As I said before, the data in this catalog lives in
your file system as opposed to within SQL Server, and therefore you
should choose carefully when you consider where your Full-Text catalog
will be stored. Unless you specify otherwise, your Full-Text catalog
will be stored in a subdirectory of the FTDATA directory that resides in
your Microsoft SQL Server\MSSQL storage location. Here is how you may go
about creating a Full-Text Catalog in a non-default location:
exec sp_fulltext_catalog 'Cat_Desc', 'create', 'f:\ft'
In this case, your Full-Text catalog will be created as a subdirectory
of 'f:\ft', and if you take a look at that part of your file system, you
will see it there as its own directory. The naming convention used for
Full-Text catalogs by MSSearch is:
SQL+dbid+catalogID
Catalog IDs start at 00005 and are incremented by one for each new
catalog created.
As a best practice, if possible, Full-Text catalogs should be created on
their own physical drive (or drives). Given the process of building a
Full-Text index is fairly I/O intensive (on a high level, it consists of
reading data from SQL Server, and then writing the index to the file
system), you probably want to avoid letting your I/O subsystem become a
bottleneck.
So, how big are your Full-Text catalogs? Generally, Full-Text catalogs
add about 30% overhead for the amount of data stored in SQL Server that
you are Full-Text indexing. This rule of thumb, however, is dependent on
the distribution of unique words (or keys) in your data, as well as
which words you consider noise words Noise words, or stop words, are
terms that are excluded from Full-Text indexes and queries because they
are not 'interesting' search terms that have a high rate of occurrence,
and only bloat your indexes. Later, we will get into some considerations
in terms of noise word selection, and how you can possibly tune your
noise words to improve your query performance.
If you have not done so already, create a unique, single-column,
non-nullable index on each table upon which you plan to build a
Full-Text index. This unique index is used to map each row in your table
to a unique, compressible key used internally by MSSearch. Next, you
want to let MSSearch know that you wish to create a Full-Text index on
your tables. Issuing the following statement for your table will add it
to a Full-Text catalog of your choice (in this case, 'Cat_Desc', which
we created above):
exec sp_fulltext_table 'Categories', 'create', 'Cat_Desc',
'PK_Categories'
Adding columns to this Full-Text index is the next step. For each
column, you may optionally choose a language, and if the column is of
type IMAGE, you must specify another column that will be used to
indicate what type of document is stored in each row of your IMAGE column.
There are some important—and not well-documented—considerations for
choosing the column language. These considerations relate to how your
text is tokenized and then indexed by MSSearch. Text being indexed is
fed through a component called a wordbreaker that tokenizes on word
boundaries. These word boundaries, in the English language, are
typically whitespace or some form of punctuation. In other languages,
such as German, words or characters may be combined together; therefore,
your choice of a column-level language should represent the language you
expect will be stored in rows of that column. If you are unsure, a
general best bet is to use the neutral wordbreaker, which performs its
tokenization purely on whitespace and punctuation. An additional benefit
of your column-level language choice is "stemming". Stemming in
Full-Text queries is defined as the process of searching for all stemmed
(inflectional) forms of a word in a particular language.
Another consideration in language choice is related to the way in which
your data is represented. For non-IMAGE column data, no special
filtering is performed. Rather, the text is generally passed through the
wordbreaking component as-is. Wordbreakers are designed mainly to
process written text. So, if you have any type of markup (such as HTML)
on your text, you may not get great linguistic accuracy during indexing
and search. In that case, you have two choices—the preferred method is
simply to store the text data in an IMAGE column, and to indicate its
document type so it may be filtered. If this is not an option, you may
consider using the neutral wordbreaker and, if possible, adding markup
data (such as 'br' in HTML) to your noise word lists. You will not
receive the benefit of any language-based stemming on a column with the
neutral language specified, but certain circumstances may dictate this
choice.
Now that you know what your column-level options are, try adding a
column or two to your Full-Text index, by issuing the following:
exec sp_fulltext_column 'Categories', 'Description', 'add'
You may notice I did not specify any language here—in this case, the
default Full-Text language will be used. You may set the default
Full-Text language for your server via the system stored procedure
"sp_configure".
With all of your columns added to your Full-Text index, you're now ready
to start a population. There is much that can be said about the various
options you have when it comes to population methods, and I'll avoid
going into too much detail—but for this example, you should simply start
a full population on your table, and wait for it to complete:
exec sp_fulltext_table 'Categories', 'start_full'
You may wish to monitor your population status using either the
FULLTEXTCATALOGPROPERTY or OBJECTPROPERTY functions. To get your catalog
population status, you can execute:
select FULLTEXTCATALOGPROPERTY('Cat_Desc', 'Populatestatus')
Typically, if a full population is in progress, the result returned is
'1'. For more details on how to use FULLTEXTCATALOGPROPERTY and
OBJECTPROPERTY, see SQL Server Books Online.
Full-Text Queries
Querying Full-Text indexes is slightly different than executing standard
relational queries in SQL Server. Since your indexes are stored and
managed external to SQL Server, Full-Text query processing is in large
part handled by MSSearch—therefore, queries that are partially
relational in nature, and partially Full-Text based will be processed
separately—which can sometimes hurt performance.
Essentially, when you execute a Full-Text query, the query terms are
passed to MSSearch, which traverses its internal data structures (the
indexes), and it returns a key and rank value to SQL Server. You
generally do not see the key or rank values when you execute a CONTAINS
or FREETEXT query; however, if you execute a CONTAINSTABLE or
FREETEXTTABLE query, you are provided with these values—which are
typically then joined against the base table. The process of joining
keys against the base table can be quite expensive—shortly, we'll
present some clever ways to minimize or completely avoid this join.
If you've been thinking ahead, and know a little bit about how Full-Text
queries return data, you may surmise that CONTAINS/FREETEXT queries are
simply performing a CONTAINSTABLE/FREETEXTTABLE query and joining
against the base table. Your understanding should lead you to avoid
using those types of queries unless the cost of not doing so would be
greater. In the case of web search applications, using CONTAINSTABLE and
FREETEXTTABLE is much better than using the TABLE-less cousins.
So far, you know that Full-Text queries are special ways of accessing
data from MSSearch indexes stored outside of SQL Server, and that you
can get in a bit of trouble by blindly joining against your base table.
Another key thing to know is the actual difference between
CONTAINS-style queries and FREETEXT-style queries.
CONTAINS queries perform an exact match for all terms you are searching
for. Whether you are simply looking for a single word, or all words
beginning with 'orange', you will only be returned results that contain
all of your search terms. Along these lines, CONTAINS queries are quite
fast, as they typically return fewer results, and generally don't need
to perform too much additional processing. Some downsides of CONTAINS
queries include the pesky problem of noise word filtering. Experienced
developers and DBAs who have worked with Full-Text Search in the past
have encountered the dreaded "Your query contains only noise words"
error when trying to match words or phrases that include even a single
noise word. One way to avoid receiving this error is to filter out noise
words prior to executing Full-Text queries. It is not possible to return
results to a CONTAINS query containing noise words, as queries of this
type are to return exact matches to your entire query string. Since
noise words are not Full-Text indexed, no rows may be returned on a
CONTAINS query that includes noise words.
FREETEXT queries overcome all of the caveats that occasionally occur in
CONTAINS queries. When you issue a FREETEXT query, you are essentially
issuing a stemmed any-words query. Therefore, when you search for "root
beer", both 'root' and 'beer' are stemmed out to all of their forms
(stemming is language specific; the language used is determined by the
Full-Text column language specified at indexing time, and must be the
same across all the queried columns), and any row that matches at least
one of those terms will be returned.
The side effect of FREETEXT queries is that they have a tendency to use
more CPU than CONTAINS queries—the stemming and larger possible set of
returned results, combined with a more complicated calculation of rank
are the culprits. Nevertheless, FREETEXT-based queries are amazingly
flexible, still extremely fast, and quite often the best way to go for
web-based search applications.
Ranking and Optimizations
I often meet with users of Full-Text Search, and they ask me what the
ranking numbers mean, and how they can translate them to some sort of
human-understandable value. There is a short answer and a long answer
that can be given to this question, but I'll stick to the short one to
be brief. Basically, those ranking numbers are not as significant as the
order in which the results are returned. What this means is that when
you order your results by rank, you are always returning the most
relevant results first. The rank values themselves are prone to
change—Full-Text Search uses a probabilistic ranking algorithm, which
means the relevance of each document you are returning is directly
effected by any and all other documents in your Full-Text index.
One trick some people think helps increase the rank of certain rows is
to repeat commonly used search keywords in the Full-Text indexed columns
of those rows. While this may, to a certain degree, help improve the
chances of those rows being returned first for certain keywords, it may
backfire in other cases—and also puts you at a moderate risk of hurting
query performance for those terms. A better solution would be to
implement a "Best Bets" system for your search application (see example
below) so you may be guaranteed that certain documents are returned
first. The problem with extensive duplication of keywords is that it can
bloat your Full-Text indexes for those specific keywords, and cause
MSSearch to spend more time than necessary in finding the right rows and
calculating rank. If you have a huge amount of Full-Text indexed data,
and have tried this tactic, you probably will find that some Full-Text
queries take quite a while. If you are able to implement a more lean
(and probably more accurate) "Best Bets" system, you are likely to find
it will make a world of difference in your query performance.
Another problem related to extensive duplication of data is related to a
commonly used trick to combine relational and Full-Text queries. This
problem plagues many people using Full-Text Search and is encountered
when one attempts to apply some sort of a filter to results returned
from a Full-Text query. As I said, Full-Text queries return a key and a
rank for each matching row—to garner any more information about those
rows, one must perform a join against its base table. Since any number
of results may possibly be returned from an unrestricted Full-Text
query, that join may become quite costly. One clever way people have
found to avoid that join is to simply add the data to be filtered (if
possible) to their Full-Text index. In other words, if someone wants to
search on the keyword of "Ichiro" from the body text of all articles in
a newspaper, but only return articles that were in the sports section of
the newspaper, queries are typically expressed similar to:
-- [APPROACH 1:]
-- most expensive: select all, then join and filter
SELECT ARTICLES_TBL.Author, ARTICLES_TBL.Body, ARTICLES_TBL.Dateline,
FT_TBL.[rank]
FROM FREETEXTTABLE(Articles, Body, 'Ichiro') AS FT_TBL
INNER JOIN Articles AS ARTICLES_TBL
ON FT_TBL.[key] = ARTICLES_TBL.ArticleID
WHERE ARTICLES_TBL.Category = 'Sports'

-- [APPROACH 2:]
-- works, but can backfire and become slow or return inaccurate results:
-- perform filtering via Full-Text and only extract key and rank
-- (processing done at web server level)
SELECT [key], [rank]
FROM CONTAINSTABLE(Articles, *, 'FORMSOF(INFLECTIONAL('Ichiro')
AND "sports"')
The problem with these queries is that they are either unnecessarily
expensive, or run the risk of returning wrong results (in the second
query, 'sports' is quite likely to occur in articles of all categories).
There are other permutations of these techniques, but these are two very
simple mockups. A suggestion I typically give, if it can be afforded, is
that of some form of horizontal partitioning of data. In other words,
each possible value for your 'categories' column could simply become its
own column (or table), and searchable keywords relating to that article
would be stored in only that column. Taking this approach, rather than
having one single 'Body' column and a 'Category' column, you could get
rid of the 'Category' column, and have a 'Body_<category>' column that
would store the searchable keywords. An example below:
-- If you can adjust your schema this works great – each category
-- becomes its own column (or table), and you only hit that
-- smaller Full-Text index. There are obviously some caveats…
SELECT [key], [rank]
FROM FREETEXTTABLE(Articles, Body_Sports, 'Ichiro')
For those systems with a large amount of data that can accommodate this
(perhaps major) schema change, a significant and appreciable performance
improvement should become immediately apparent. There are obvious
limitations when it comes to applying multiple or no filters; there are
certainly other ways to work around those problems. From the example
above, you can get a picture of one way to abstract some of your search
conditions into your schema—in essence 'cheating' the optimizer (more
appropriately, 'becoming' the optimizer) as there is little to no native
optimization currently possible for Full-Text queries within SQL Server
itself.
Additional Performance Tricks
Another request I commonly get from people I speak with is that of the
ability to page through Full-Text query results. In other words, if I
were to issue a query for "root beer", with the results to be displayed
on a web page 40 at a time, I'd want to only return the 40 results for
that particular page (if I were on page three, for example, I'd want to
return only results 81-120).
There are several approaches I have seen when it comes to paging through
results, but none of them have been 100% effective in terms of being
very efficient. The approach I suggest lets you minimize the number of
Full-Text queries you execute (in effect, only one per set of results to
be paged through), and use your web server as a simple cache. On a
high-level, what you should do is retrieve one complete rowset of keys
and rank values for your Full-Text query (you can fold in Best Bets and
abstract common filters to your schema if you desire), and store them in
memory on your web server (depending on your application and load,
imagine a typical key size of <32 bytes plus a rank size of <4 bytes =
<36 bytes multiplied by a typical returned result set <1000 rows is
<35K. Assume an actively cached set of <1000 active query result sets at
any given time, and you will find that it occupies less than 35MB of RAM
on the web server—not too shabby).
In order to page through the results, the process simply becomes
traversing an array stored in memory on your web server and issuing a
SELECT against your SQL Server for only the rows and columns you wish to
display. This also gets back to the concept of only returning keys and
ranks for Full-Text queries—a SELECT (even many of them) is many times
faster than a Full-Text query. By using SELECT as opposed to joining
many rows against the base table, combined with several other tactics,
you will be able to reserve more CPU cycles on your SQL Server machines,
and get more use of your (less expensive) web farm.
An alternate approach to web server-side caching is to cache result sets
in SQL Server itself, and to define various methods for navigating
through those results. Though this article focuses primarily on
application design at the web server (ASP) level, the programmability
features of SQL Server also provide a rich framework for building
high-performance search applications for the web.

DotNet First Side

Wednesday, February 13, 2008

Querying SQL Server Using Full-Text Search

No comments:

Post a Comment

Dotnet-Interviews

Blog Archive

Blog Roll