MATCH (col1,col2,...) AGAINST (expr[search_modifier])search_modifier:{ IN BOOLEAN MODE | WITH QUERY EXPANSION }
As of MySQL 3.23.23, MySQL has support for full-text indexing and searching:
A full-text index in MySQL is an index of type
FULLTEXT.
Full-text indexes can be used only with
MyISAM tables, and can be created only for
CHAR, VARCHAR, or
TEXT columns.
A FULLTEXT index definition can be given in
the CREATE TABLE statement when a table is
created, or added later using ALTER TABLE
or CREATE INDEX.
For large datasets, it is much faster to load your data into a
table that has no FULLTEXT index and then
create the index after that, than to load data into a table
that has an existing FULLTEXT index.
Full-text searching is performed using MATCH() ...
AGAINST syntax. MATCH() takes a
comma-separated list that names the columns to be searched.
AGAINST takes a string to search for, and an
optional modifier that indicates what type of search to perform.
The search string must be a literal string, not a variable or a
column name. There are three types of full-text searches:
A boolean search interprets the search string using the rules
of a special query language. The string contains the words to
search for. It can also contain operators that specify
requirements such that a word must be present or absent in
matching rows, or that it should be weighted higher or lower
than usual. Common words such as “some” or
“then” are stopwords and do not match if present
in the search string. The IN BOOLEAN MODE
modifier specifies a boolean search. For more information, see
Section 12.8.1, “Boolean Full-Text Searches”.
A natural language search interprets the search string as a phrase in natural human language (a phrase in free text). There are no special operators. The stopword list applies. In addition, words that are present in more than 50% of the rows are considered common and do not match. Full-text searches are natural language searches if no modifier is given.
A query expansion search is a modification of a natural
language search. The search string is used to perform a
natural language search. Then words from the most relevant
rows returned by the search are added to the search string and
the search is done again. The query returns the rows from the
second search. The WITH QUERY EXPANSION
modifier specifies a query expansion search. For more
information, see Section 12.8.2, “Full-Text Searches with Query Expansion”.
Constraints on full-text searching are listed in Section 12.8.4, “Full-Text Restrictions”.
mysql>CREATE TABLE articles (->id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,->title VARCHAR(200),->body TEXT,->FULLTEXT (title,body)->);Query OK, 0 rows affected (0.00 sec) mysql>INSERT INTO articles (title,body) VALUES->('MySQL Tutorial','DBMS stands for DataBase ...'),->('How To Use MySQL Well','After you went through a ...'),->('Optimizing MySQL','In this tutorial we will show ...'),->('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),->('MySQL vs. YourSQL','In the following database comparison ...'),->('MySQL Security','When configured properly, MySQL ...');Query OK, 6 rows affected (0.00 sec) Records: 6 Duplicates: 0 Warnings: 0 mysql>SELECT * FROM articles->WHERE MATCH (title,body) AGAINST ('database');+----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 5 | MySQL vs. YourSQL | In the following database comparison ... | | 1 | MySQL Tutorial | DBMS stands for DataBase ... | +----+-------------------+------------------------------------------+ 2 rows in set (0.00 sec)
The MATCH() function performs a natural
language search for a string against a text
collection. A collection is a set of one or more
columns included in a FULLTEXT index. The
search string is given as the argument to
AGAINST(). For each row in the table,
MATCH() returns a relevance value; that is, a
similarity measure between the search string and the text in that
row in the columns named in the MATCH() list.
By default, the search is performed in case-insensitive fashion.
In MySQL 4.1 and up, you can make a full-text search by using a
binary collation for the indexed columns. For example, a column
that has a character set of latin1 character
set of can be assigned a collation of
latin1_bin to make it case sensitive for
full-text searches.
When MATCH() is used in a
WHERE clause, as in the example shown earlier,
the rows returned are automatically sorted with the highest
relevance first. Relevance values are non-negative floating-point
numbers. Zero relevance means no similarity. Relevance is computed
based on the number of words in the row, the number of unique
words in that row, the total number of words in the collection,
and the number of documents (rows) that contain a particular word.
To simply count matches, you could use a query like this:
mysql>SELECT COUNT(*) FROM articles->WHERE MATCH (title,body)->AGAINST ('database');+----------+ | COUNT(*) | +----------+ | 2 | +----------+ 1 row in set (0.00 sec)
However, you might find it quicker to rewrite the query as follows:
mysql>SELECT->COUNT(IF(MATCH (title,body) AGAINST ('database'), 1, NULL))->AS count->FROM articles;+-------+ | count | +-------+ | 2 | +-------+ 1 row in set (0.00 sec)
The first query sorts the results by relevance whereas the second does not. However, the second query performs a full table scan and the first does not. The first may be faster if the search matches few rows; otherwise, the second may be faster because it would read many rows anyway.
For natural-language full-text searches, it is a requirement that
the columns named in the MATCH() function be
the same columns included in some FULLTEXT
index in your table. For the preceding query, note that the
columns named in the MATCH() function
(title and body) are the
same as those named in the definition of the
article table's FULLTEXT
index. If you wanted to search the title or
body separately, you would need to create
separate FULLTEXT indexes for each column.
It is also possible to perform a boolean search or a search with query expansion. These search types are described in Section 12.8.1, “Boolean Full-Text Searches”, and Section 12.8.2, “Full-Text Searches with Query Expansion”.
A full-text search that uses an index can name columns only from a
single table in the MATCH() clause because an
index cannot span multiple tables. A boolean search can be done in
the absence of an index (albeit more slowly), in which case it is
possible to name columns from multiple tables.
The preceding example is a basic illustration that shows how to
use the MATCH() function where rows are
returned in order of decreasing relevance. The next example shows
how to retrieve the relevance values explicitly. Returned rows are
not ordered because the SELECT statement
includes neither WHERE nor ORDER
BY clauses:
mysql>SELECT id, MATCH (title,body) AGAINST ('Tutorial')->FROM articles;+----+-----------------------------------------+ | id | MATCH (title,body) AGAINST ('Tutorial') | +----+-----------------------------------------+ | 1 | 0.65545833110809 | | 2 | 0 | | 3 | 0.66266459226608 | | 4 | 0 | | 5 | 0 | | 6 | 0 | +----+-----------------------------------------+ 6 rows in set (0.00 sec)
The following example is more complex. The query returns the
relevance values and it also sorts the rows in order of decreasing
relevance. To achieve this result, you should specify
MATCH() twice: once in the
SELECT list and once in the
WHERE clause. This causes no additional
overhead, because the MySQL optimizer notices that the two
MATCH() calls are identical and invokes the
full-text search code only once.
mysql>SELECT id, body, MATCH (title,body) AGAINST->('Security implications of running MySQL as root') AS score->FROM articles WHERE MATCH (title,body) AGAINST->('Security implications of running MySQL as root');+----+-------------------------------------+-----------------+ | id | body | score | +----+-------------------------------------+-----------------+ | 4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 | | 6 | When configured properly, MySQL ... | 1.3114095926285 | +----+-------------------------------------+-----------------+ 2 rows in set (0.00 sec)
The MySQL FULLTEXT implementation regards any
sequence of true word characters (letters, digits, and
underscores) as a word. That sequence may also contain apostrophes
(‘'’), but not more than one in a
row. This means that aaa'bbb is regarded as one
word, but aaa''bbb is regarded as two words.
Apostrophes at the beginning or the end of a word are stripped by
the FULLTEXT parser;
'aaa'bbb' would be parsed as
aaa'bbb.
The FULLTEXT parser determines where words
start and end by looking for certain delimiter characters; for
example, ‘ ’ (space),
‘,’ (comma), and
‘.’ (period). If words are not
separated by delimiters (as in, for example, Chinese), the
FULLTEXT parser cannot determine where a word
begins or ends. To be able to add words or other indexed terms in
such languages to a FULLTEXT index, you must
preprocess them so that they are separated by some arbitrary
delimiter such as ‘"’.
Some words are ignored in full-text searches:
Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is four characters.
Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overwritten by a user-defined list.
The default stopword list is given in Section 12.8.3, “Full-Text Stopwords”. The default minimum word length and stopword list can be changed as described in Section 12.8.5, “Fine-Tuning MySQL Full-Text Search”.
Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row.
Such a technique works best with large collections (in fact, it
was carefully tuned this way). For very small tables, word
distribution does not adequately reflect their semantic value, and
this model may sometimes produce bizarre results. For example,
although the word “MySQL” is present in every row of
the articles table shown earlier, a search for
the word produces no results:
mysql>SELECT * FROM articles->WHERE MATCH (title,body) AGAINST ('MySQL');Empty set (0.00 sec)
The search result is empty because the word “MySQL” is present in at least 50% of the rows. As such, it is effectively treated as a stopword. For large datasets, this is the most desirable behavior: A natural language query should not return every second row from a 1GB table. For small datasets, it may be less desirable.
A word that matches half of the rows in a table is less likely to locate relevant documents. In fact, it most likely finds plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that rows containing the word are assigned a low semantic value for the particular dataset in which they occur. A given word may exceed the 50% threshold in one dataset but not another.
The 50% threshold has a significant implication when you first try full-text searching to see how it works: If you create a table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results. Be sure to insert at least three rows, and preferably many more. Users who need to bypass the 50% limitation can use the boolean search mode; see Section 12.8.1, “Boolean Full-Text Searches”.

User Comments
Hyphen '-' characters break literals at the moment.
found with full text searches.A search for something like "GATA-D22S690" finds
all entries containing GATA and not the full
hyphenated text. The '-' character is treated as a
word stop even within literals. The same is true if
any of the special text search modifiers are used (eg
It should be noted in the documentation that IN
BOOLEAN MODE will almost always return a
relevance of 1.0. In order to get a relevance that is
meaningful, you'll need to:
<BR/><BR/>
SELECT MATCH('Content') AGAINST ('keyword1
keyword2') as Relevance FROM table WHERE MATCH
('Content') AGAINST('+keyword1 +keyword2' IN
BOOLEAN MODE) HAVING Relevance > 0.2 ORDER
BY Relevance DESC
<BR/><BR/>
Notice that you are doing a regular relevance query
to obtain relevance factors combined with a WHERE
clause that uses BOOLEAN MODE. The BOOLEAN
MODE gives you the subset that fulfills the
requirements of the BOOLEAN search, the relevance
query fulfills the relevance factor, and the HAVING
clause (in this case) ensures that the document is
relevant to the search (i.e. documents that score
less than 0.2 are considered irrelevant). This also
allows you to order by relevance.
<BR/><BR/>
This may or may not be a bug in the way that IN
BOOLEAN MODE operates, although the comments
I've read on the mailing list suggest that IN
BOOLEAN MODE's relevance ranking is not very
complicated, thus lending itself poorly for actually
providing relevant documents. BTW - I didn't notice
a performance loss for doing this, since it appears
MySQL only performs the FULLTEXT search once,
even though the two MATCH clauses are different.
Use EXPLAIN to prove this.
An easy solution to correct for spelling errors for small search items like the name of the city is to build a column that contains the SOUNDEX of each. I've found that using a 4 character SOUNDEX works the best. An example:
ALTER TABLE cities ADD city_soundex VARCHAR(4) NOT NULL;
UPDATE cities SET city_soundex=LEFT(SOUNDEX(city_name),4);
And then to query against:
SELECT * FROM citites WHERE city_soundex=LEFT(SOUNDEX('Some City Name'),4);
The full-text search is slow when there are a lot of rows in the table. I have more than 2 million rows with text and multiple word searches (3 or more) take about 30 seconds to a minute or longer.
I am running a Athlon 2.2 Ghz with 512 MB DDR RAM 400 Mhz. Hard drive has seek time of 9 ms.
We too have a database with close to 6 million rows in it. We would like to use the fulltext search, but it is painfully slow. The sad thing is that any one query we need to run only needs to be run on a subset of the rows (think of those 6 million rows as being divided between about 80 different categories, with results only needed to be returned from within a category). It is a shame that there isn't some way for us to have a fulltext index within the rows of another column which is also indexed.
Our next shot at this is going to be to create 80 different tables, one for each category (yuck!), just to try to get decent performance on the within-category fulltext search.
I would think there is enough interest in fulltext search for there to be an email list dedicated to it where those of us who need to use it on real-world web sites could interact with the developers to try to tune it for the good of the community at large.
The nature of fulltext indexing is such that the more 'words' (of greater than the minimum length) that appear in the columns you index, the greater will be size of the index, and the time it takes to create and search that index.
4 Million rows x 6 words per row = 24 Million entries
4 Million rows x 30 words per row = 120 Million entries
If you index intelligently, you may well find that the feature meets your needs, despite what some users may have remarked above. "Number of rows" is useless as a benchmarking statistic on its own.
I've got a webshop running with mysql. We have 1.6 million booktitles in our database, and our fulltext index is from title and 'extra' title. Maybe around 5-7 words on average per record.
Any fulltext search in our database takes around 1-5ms. A three word search can take a few seconds but still faster than anything else...Even Amazon.
This is from our main server which has several thousand hits every day. We are one of the biggest booksellers in my country.
-This was in 'reply' to someone posting earlier about poor performance in 'real world situation'.
BENCHMARK INFO!
We have a vast database of products (totaling 3 million rows).
The rebuild of the full-text index took 21 minutes to complete.
As for the search, a multiple word search on the full-text index of 1 field is as follows:
20 rows returned
Query took 1.0263 sec
The size of the index is very big though.
Index 449,529 KB
We are going to eliminate unnecessary words which should make the index smaller.
The indexing process definitely depends on your machine config. I indexed the same dataset, same mysql version (4.0.16, -standard on Linux, -nt on XP).
First machine: a 512M Athlon 2100 running WinXP on a RAID1 array, and it took 9 mins 20 sec.
Second machine: 1G RAM P4-2.0ghz, running Linux 7.3 (2.4.20-13.7 kernel), one-spindle (but fast) IDE disk, and it took 29 mins 11 sec.
The query was "alter table msg_bodies add fulltext fti_bodies (subject, body);", and the dataset is about 550M, 1.3 million rows.
Both machines were mostly idle when the indexing occurred. I know it's just anecdotal, but if you run a big alter as I did, don't be surprised if it takes a while... oh yeah, and then be prepared for MUCH longer than normal insert times when you add large data to the FT indexed columns. This makes sense of course, but, at least on my dataset and hardware, is positively NOT neglible. FYI.
Full text searching in 4.1.1
For Chinese charset in UTF-8 encoding.
It needed to set ft_min_word_len =1
(MySQL 4.1)
To make FULLTEXT MATCH work with Japanese UTF-8 text, be careful that words from your Japanese text be separated by the __ASCII__ space character, not Japanese UTF-8 (or other) spacing characters.
(when using phpMyAdmin to manage data/ write a SQL query, you must switch away from your Japanese IME to insert a space char...)
Tom's fulltext tips: To get MySQL searching well for me I did:
1. Have a normalized versions of the important columns: where you've stripped punctuation and converted numerals to words ('1' to 'one'). Likewise normalise the search string.
2. Have a combined fulltext index on your searchable columns to use in your 'WHERE' clause, but then have separate fulltext indexes on each column to use in the 'ORDER BY' clause, so they can have different weights.
3. For the scoring algorithm, include the independent importance of that record, and include a match of the inclusive index against stemmed versions of the search words (as: "wedding" => "wed", "weddings").
4. If there's exactly one result, go straight to that record.
5. If you get no results, try matching against the start of the most important column (WHERE column LIKE 'term%'), and put a 5-character index on that column. This helps if someone is searching on a very short word or a stopword.
6. Reduce minimum word length to 3, and make a new stopwords list just using "a an and the is in which we you to on this by of with". Use "REPAIR TABLE xxx QUICK" to rebuild the index and make a note of the index-file (xxx.MYI) size before and after you make changes. Then use ft_dump to tune.
There's an interesting and helpful set of slides about fulltext search by Sergei Golubchik (the developer of this stuff) at http://www.phpconference.de/2003/slides/database_track/golubchik_mysql_fulltext_search_2003.pdf
It throws some light on quite a few of the questions posed here.
Cheers
James
More about performance:
Fulltext search in MySQL isn't slow really. It's slower than normal index selects, but that's not noticable.
I'm playing tables that each contains ca 4million records, about 6GB of text that needs to be indexed. The problem lies in the optimization of the queries, that use the MATCH() and AGAINST() functions. So far as I found out MySQL can use in a query only one index. The optimizer looks for the index that will possibly give the smallest amount of rows, then goes trough all of these, and removes those records, that fall out because of the other WHERE statements.
When you enter a fulltext search, the server _has_ to use that index. Any other statments could be applied only to the rows that were returned by the search. If you have lots of records, most of the searches will return lots of results. The rest of your query will be executed like if you were executing it against a table that contains the results, and no indexes - obviously that will be processed sequentially.
Same thing applies when you use the SQL_CALC_FOUND_ROWS, or LIMIT with high offset values: first the server has to load all the results, then the limit could be applied.
If you _have_ to use the fulltext, these are some hints how you could get quite good results:
- Try to avoid any group, where, order and any statements for what it's necessary to get all the results. I know, usually this is impossible.
- If you need to show the results on a web page, try to find other methods to get the total number of results than using SQL_CALC_FOUND_ROWS. Better if you don't tell the user at all how many results there are ;)
- If indexing text that is in other language than english, before doing anything create a stopword file for your language! (That could reduce index size about 30%)
But most important: think a lot, before you decide to use fulltext search!
We are doing fulltext searches of around 200 tables of 1/4 -1/2 million rows each. Upgrading a twin cpu 2Ghz linux machine (running ES 3) from 1GB to 3GB RAM and increasing key_buffer_size from 384MB to 512MB has seen load averages go from 1.5-2 to around 0.5, with same usage.
Hi Attilla:
I'm not sure I agree with your comment that Mysql *always* uses a fulltext index if you include match...against.
I tried a sample query where I used match...against and another value in the where clause (on the primary keyed field) and Mysql's optimizer picked the primary key index.
The statement looked like "select something from sometable where match(col1, col2) against ('some text here') and pk_col = 44
http://dev.mysql.com/doc/mysql/en/Fulltext_Boolean.html states you cannot do a full-text search in Boolean Mode by relevancy. You can, but you need mysql version 4+. Take note of "ORDER BY relevancy DESC".
Here is my code/example:
SELECT *,MATCH (YR,AU,ST,SD,SC,BT,BD,BC,AT,AD,AC,JR,KW,AUS,GEO,AN,RB,CO,RR) AGAINST ('Margaret Atwood' IN BOOLEAN MODE) AS relevancy FROM
ccl_main WHERE MATCH (YR,AU,ST,SD,SC,BT,BD,BC,AT,AD,AC,JR,KW,AUS,GEO,AN,RB,CO,RR) AGAINST ('Margaret Atwood' IN BOOLEAN MODE) ORDER BY
relevancy DESC;]
See The Full-Text Stuff That We Didn't Put In The Manual at http://dev.mysql.com/tech-resources/articles/full-text-revealed.html for more information about full-text search.
Thought some benchmarks for a very large data set might be useful to people. I created a full-text index on a 100 gig database, where the index was on a column where the data totals 3 gig. The index added 2 gig to the database. The database has 2,741,000 records. The textual content is technical literature in US English.
Machine: The machine is an Athlon XP 3200+ w/ 1 gig RAM, one drive for the OS, and another 250GB EIDE drive for the MySQL data. OS is Redhat 9.
Index creation: 3 hours 5 min. Most of that time was probably copying the table (since MySQL copies the table before it modifies it). The process was definitely disk-limited -- CPU usage was never regularly above 50%. I believe that were it not for the excess data to copy, the index creation would have taken about 45 minutes.
Queries were chosen to compare rare words, common words, and words of different lengths. All queries seemed to be disk-limited (CPU utilization hardly moved off baseline), which makes sense given that there is not enough RAM to hold the index in memory. All queries were done mutiple times and the average time reported. Results follow.
Word, Rows, Seconds, Seconds/Row
----------------------------------------
bilobed, 4, 0.15, 0.0375
mends, 4, 0.19, 0.0475
nanotechnology, 23, 0.64, 0.0278
bioluminescent, 53, 1.53, 0.0289
photosynthesis, 81, 2.21, 0.0273
graphite, 5070, 123.00, 0.0243
bicycle, 5385, 122.00, 0.0227
titanium, 13503, 350.00, 0.0259
(titanium,graphite), 18423, 425.00, 0.0231
metal, 151095, 4020.00, 0.0266
This is just a small test on the way to indexing the full 100 gig in the database that I am working on for http://www.freepatentsonline.com. I'll post results for that once I have a new server built.
In reply to Patrick O'Lone's post, above:
The first, non-boolean MATCH can't keep pace with the second, boolean one since it does not recognise the boolean operators. A search for foo* might turn up rows with foo, foobar, foofighters, etc but the non-boolean, relevance MATCH can't "count" anything except foo. Same problem with a phrase search.
Since effectively you can't use boolean operators in the second, boolean MATCH, it's rendered pointless.
Results could be post-processed with your own ranking algorithm but it's kind of odd that you can't do a boolean search AND rank results in the query.
Those looking to simply match a search phrase in a large group of words, try this:
SELECT * FROM data WHERE
haystack LIKE ('%needle%')
AND haystack RLIKE '[[:<:]]needle[[:>:]]'
This query will produce the same result as the following query, but is roughly 20X faster.
SELECT * FROM data WHERE
haystack RLIKE '[[:<:]]needle[[:>:]]'
For more than one word use:
haystack LIKE ('%word1%word2%')
AND haystack RLIKE '[[:<:]]word1[[:>:]].*[[:<:]]word2[[:>:]]'
The Wikipedia encyclopedia uses MySQL boolean full text search and MySQL 4.0.20. You can assess the speed yourselves. About 350,000 articles/ rows in the English language version, roughtly 5GB total. For all languages, one million articles and 10GB of text.
It's very important to have a sufficiently large key_buffer_size. Get much of the index in RAM and searches are typically very fast. Because we use InnoDB tables and can't share cache between InnoDB and MyISAM, we're moving to a setup which will use a dedicated and MyISAM tuned search server.
We run a slow query killer which will kill queries once their allowed time expires. That time depends on server load (active thread count). Documented at http://wp.wikidev.net/Querybane
It turns out that our current search often spends most time on things other than full text search - an improved version which is much more efficient is pending. Remember that you can use a self join to get around the one index per query limit of MySQL.
Search remains our most problematic load feature, requiring a couple of quite capable slaves to keep up at busy time. If we're working on the servers and the remainder can't handle the load, we switch to Google or Yahoo search. This is from one of the top few hundred sites on the net, so it's scaling pretty well for our application. One server was sufficiently fast to take us to the top 1,000.
Hi all! I had a problem with FULLTEXT search and after I solved it, I want to try to help you. I thought that FULLTEXT search is slow. That was. But I did a simple trick:
1. For example, we have 4 tables with fulltext index. We need to perform fast search on them.
2. Do the following:
CREATE TEMPORARY TABLE xxx SELECT id, name, MATCH(name) AGAINST ('search_string') AS relevancy FROM table1;
INSERT INTO xxx SELECT id, name, MATCH(name) AGAINST ('search_string') AS relevancy FROM table2 ...
3. Then, when the temporary table is filled with the data, do the simple select from it:
SELECT id, name FROM xxx ORDER BY relevancy DESC
4. That's all.
I think, it is the optimal way to make a VERY FAST fulltext search from a number of tables. I hope, this will be helpful.
Allthough my fulltext serach works fine, i still have problems with it.
i compare products with there prices. i used the "relevance" idea (written above). But it will not recognize it correctly.
Heres my query:
SELECT *, MATCH( field ) AGAINST ('blabla') as relevance FROM `table` WHERE MATCH( field ) AGAINST ('blabla' IN BOOLEAN MODE ) HAVING relevance > 0.2
for example it finds 18 rows. to increase the rows, i checked (manuelly) the relevance. 3 above 18 (all 18.xxx) and the rest about 10.3 or lower. If i increase the "having" to 15, it finds only 2 of 3.
The field i use is the only index-field and ofcourse a fulltext specific field.
it seems that the relevance is taking part of the search results.
i am still looking for another idea but the relevance would cover my needs 100%
*update*
I just updated the mysql version from 4.0.15 to 4.1.8 and it works perfectly...
to set up min_len and stopword_file in windows (win2k or xp):
1. run services.msc to check what .ini your mysql is reading (in my case, "C:\MySQL\bin\mysqld-nt" --defaults-file="C:\MySQL\my.ini" MySQL)
2. change your my.ini like this:
[mysqld]
ft_min_word_len=3
ft_stopword_file="C:\\MySQL\\stop.txt"
3. restart your mysqld at services.msc
4. reindex your table using REPAIR TABLE tbl_name QUICK;
done!
it is legal to use two different arguments with the "double match construction", i.e.
select * , match (artist,album,title) against ('blues in orbit') from musicfiles where match (artist,album,title) against ('ellington');
will FIND all records with 'ellington' as substring of artist, album or title, but will RATE them as the search match'es 'blues in orbit'
You can even .... ORDER BY or GROUP BY MATCH (kunstner,albumtitel,titel) AGAINST ('prelude to a kiss'); ... or against anything else!
Maybe a little off topic here.
Alternatively, instead of doing all full-text search within MySql
database, you can pull data out and create an index on it.
It's a much faster search.
I am using DBSight to search on 1.7million records of 1.2G data, on a P3 450MHz, 256MRam, with sub-second performance.
Taking search out of database also give you capability to customize stemmer and query parser.
When using FULLTEXT queries on large set data sets it's critical to LIMIT your results. Doing some experimentation has made this very clear. Using a data set of 5.5 million rows of forum message posts indexed with a single FULLTEXT index:
select id, match(post) against('foo') as score from messages where match (body) against( 'foo' );
...
155323 rows in set (16 min 7.51 sec)
select id, match(post) against('foo') as score from messages where match (body) against( 'foo' ) limit 100;
...
100 rows in set (1.25 sec)
I ran a number of other tests on various terms known to be in the text with similar results.
These were run in reverse order shown. Make sure you return only the rows you need or you will suffer. For a search engine application returning pages of results keep in mind that nobody is going to ever see page 74! Cap the results to a reasonable maximum trading response time for completeness where possible.
In regards to Dyfed Lloyd Evans comment, I believe that "." will also cause this which is unfortunate because we have product name acronyms we'd like to seach.
After tons of hours today working on this, I HAVE FINALLY MASTERED THE TECHNIQUE OF USING THIS THING!!!!
# This query is what you send to your MySQL to return the results. I put the LIMIT so that you don't overflow.
$query_ret="SELECT ProductID, Title, Description, Price, RetailPrice, MATCH (Title) AGAINST ('$keyword') AS score FROM server.book HAVING score > 0 LIMIT $start, $maxret;";
# This query will COUNT the number of rows it found. (I believe that's correct), I don't believe it counts occurrences, just rows. I saw that if you pull back without a LIMIT above, and count that way, it's 100000x slower. So do your count like this:
$query_count="SELECT count(MATCH(Title) AGAINST('$keyword')) AS score FROM server.book WHERE MATCH (Title) AGAINST ('$keyword') HAVING score > 0;";
Make sure you have your Primary Key setup, your Title and Description as SEPARATE FULLTEXT INDEXES. I spent a few hours boggling over this.
Nathan
There is no way to perdict what maximum relevance rank could be. While working with full text searches one may want to show percentages as the criteria for indicating how close a particular record was to the search query. To achieve this, one way is to select the maximum relevance rank or score and then use it( as a denominator ) with every single record score to get percentage equivalent of score.
For the sake of example, consider we have 6 as maximum rank and 2,3,4.234 are scores for three different records. Now to get percentage we have to do simple maths i.e., we can divide each score by 6(max rank) and then mulitply the result with 100.
(2/6)*100 = ..%
(3/6)*100 = ..%
(4.234/6)*100 = ..%
Hope this helps someone.
Saqib Aziz
The link in James Munro's comment regarding Sergei Golubchik's PHP Conference slides (http://www.phpconference.de/2003/slides/database_track/golubchik_mysql_fulltext_search_2003.pdf)
no longer seems to work.
The following link works, today at least. :) http://www.php-kongress.de/2003/slides/database_track/golubchik_mysql_fulltext_search_2003.pdf.
Tips on how to maximize performance, and minimize return time.
Let me guess: you have a lot of data, and you went ahead and FULLTEXT indexed a column within that large pile of data. To put it in simple words: not cool.
How long does your data take to return from a FULLTEXT search... 5 seconds? More? If you're returning data under 3 seconds, then you're ok, but if you want it to return in 1.3 seconds, here's what I did.
1. Dont fulltext index a column within a huge table. Instead, take a Unique Identifier, and the text you're searching, and copy it to another table.
Use this to export your columns:
SELECT uniquecolumn, mytextsearchcolumn FROM mydatabase.mytable INTO OUTFILE "c:/path/outfile.txt";
That will export your data to a file in Tab Delimited format.
Now, create a new table somewhere, and call it fulltextengine or something, with only 2 columns (less is better), you could add one more if you need to.
Now import your data:
LOAD DATA INFILE "c:/path/outfile.txt" IGNORE INTO TABLE mydatabase.fulltextengine;
Now, create your FULLTEXT index on your search field.
Great. Now you have a separate table for searching. So if you want to try it out, go into MySQL, and try this, or PHPmyAdmin:
SELECT SQL_CALC_FOUND_ROWS uniquecolumn, searchcolumn, MATCH (searchcolumn) AGAINST
('Keyword Goes Here') AS score FROM mydatabase.fulltextengine HAVING score > 0;
Hope this helps!
It may add extra maintenance, but I will give up my soul to increase search speeds by 6x.
Not sure if this is useful for people trying to reduce their search dataset. But what I'm doing is preprocessing the text before I add it to the database. So I add the full text to my FileData column, but I also preprocess the text and put this processed text into a Keywords column. Then I search only the keywords column and never the full text.
This technique obviously (you'll see why) doesn't work for phrase matching but it may speed up the search time by reducing the size of the dataset to search. Here's the algorithm I use.
1. extract the text
2. count each word of 2 or more characters in the text and record the frequency that each occurs
3. from this list of words 2 or more characters long, remove the k (k currently = 500) most common words in the english dictionary
4. sort the list so that the most frequently occurring words appear first
5. take the first n words as the final keywords, where n > 0 and n < the total number of remaining words
If you have performance problems on fulltext search,
Please try Senna, fulltext search engine that can be embedded in MySQL.
http://qwik.jp/senna/
Original MySQL fulltext search is slow for query SELECT COUNT(*) or SELECT * FROM xx LIMIT large_number,x.
These queries are very fast with Senna's 2ind patch.
I'm not convinced that stop-words are of great value. Sure, they might reduce the size of the index and speed-up queries for some kinds of databases. But there are two fundamental flaws with "mandatory" stop-words. Firstly, without knowing in advance the nature of the data, how can any programmer proclaim to know which words should be excluded because they have no value? Secondly, if a user includes any stop word in a search query, then by definition that word's value can not be zero. If the word's value was zero, then why would the user use it in a search query?
If you need to disable stop-words without re-compiling, consider appending a character to the end of each word in your text before inserting your text into the database. I used "q". I also right-padded all words shorter than 4 characters to a length of 4 characters by appending "q"s. And I appended a " " to the end of my text string before inserting into the database.
After retrieving my text from the database, I did a global replace on my text. I changed "qqq" to "", "qq" to "", and "q " to " " - in that order - to restore the text to it's original form. That was not the best solution but it worked. Luckily for me, my text was simply a space delimited list of words without and punctuation. Otherwise, my "q" encoding and decoding would have been more difficult.
Combining MATCH with MAX, GROUP BY and ORDER BY can be of a lot of help when retrieving results in the correct order regarding relevance.
For example, using this I could solve a problem with one table called 'Professional', which had a many-to-many reference to another table 'Profession'. Although I was using DISTINCT I was getting duplicate results because of different relevance values that MATCH was giving to the different entrances in 'Professions'.
Here I am copying the final query that worked OK (it's a bit simplified):
select distinct p.idProfessional, MAX(MATCH (pssion.name, pssion.description) AGAINST ('string to search')) as professionRanking FROM professional p, professional_profession pp, profession pssion WHERE pp.Professional_idProfessional = p.idProfessional AND pp.Profession_idProfession = pssion.idProfession and ( MATCH (pssion.name, pssion.description) AGAINST ('string to search') GROUP BY p.idProfessional ORDER BY professionRanking DESC
Add your own comment.