Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!



Photo
- - - - -

Ecommerce Site Search Tip


  • Please log in to reply
42 replies to this topic

#31 1dmf

1dmf

    Keep Asking, Keep Questioning, Keep Learning

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,167 posts
  • Location:Worthing - England

Posted 26 May 2009 - 12:07 PM

QUOTE
Just Google for search appliance and check them out. This one will crawl the site (which we want for many reasons) or it can be pointed to a dB. The cost is directly associated to the number of pages you want crawled. The more pages, the more money.


Really , wow, does it crawl and index or do an 'on-the-fly' crawl of the pages.

I could write a script which did a directory listing, scanned each page it found in the directory looking for a text-match and then list the results with hyperlinks to the pages it found, probably all within an a hour or so.

I have implemented a simple DB queries against any given field in the DB for 'broad match' with out the need for wild cards on my dance-music site, try it, it's simple yet powerful.

I guess you're the type of salesman that can sell sand to the Arabs and ice to the Eskimos! lol.gif

Nice job thumbup1.gif

#32 BBCoach

BBCoach

    HR 5

  • Moderator
  • 402 posts

Posted 26 May 2009 - 12:28 PM

QUOTE
Really , wow, does it crawl and index or do an 'on-the-fly' crawl of the pages.
It crawls and indexes both on either a schedule or continuously.

QUOTE
I could write a script which did a directory listing, scanned each page it found in the directory looking for a text-match and then list the results with hyperlinks to the pages it found, probably all within an a hour or so.
Yes you can, but then there's the work for analyzing search phrases found/not found and click-thrus (if any). Writing a spider is almost trivial in .NET, but the other stuff is way harder and time-consuming. That's why I recommended the purchase of a search appliance which also takes any load off of the web servers and dBs. Very important with a highly trafficked site.

#33 1dmf

1dmf

    Keep Asking, Keep Questioning, Keep Learning

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,167 posts
  • Location:Worthing - England

Posted 27 May 2009 - 03:50 AM

ah so it runs as a website plugin where really it's embeded or iframed into the webpage and actually runs on a separate server.

though i'm interested to see how long it would take to scan every html file on a site with @ 50 pages, extracting the text only via regex and doing a string match.

hmm, I see a self appointed task coming on here, thanks for the insight.

Not sure how to index stuff, once knew a coder who wrote an indexing algo for a book shop in C with an informix backend, wow was it fast, that's a bit out my league, but I shall have a play with an on the fly search and see what the overhead is.

As you say it's also relative to visitor volumes , server load and search volumes, I've never been in a position to have to worry about visitor loads on the server, unlike you experts!

#34 Gerry White

Gerry White

    HR 2

  • Active Members
  • PipPip
  • 48 posts
  • Location:UK

Posted 27 May 2009 - 04:25 AM

QUOTE(1dmf @ May 27 2009, 09:50 AM) View Post
ah so it runs as a website plugin where really it's embeded or iframed into the webpage and actually runs on a separate server.

though i'm interested to see how long it would take to scan every html file on a site with @ 50 pages, extracting the text only via regex and doing a string match.

hmm, I see a self appointed task coming on here, thanks for the insight.

Not sure how to index stuff, once knew a coder who wrote an indexing algo for a book shop in C with an informix backend, wow was it fast, that's a bit out my league, but I shall have a play with an on the fly search and see what the overhead is.

As you say it's also relative to visitor volumes , server load and search volumes, I've never been in a position to have to worry about visitor loads on the server, unlike you experts!


There is WAY more to a insite search engine than that
1. Stop words (words to ignore by a search engine, typically what, where how
2. Stemming
3. Synomyns (and of course spelling correction)
4. Fuzz Logic,
5. Prioritisation (ranking)
6. which parts of a page to index, not too (should it index related products for example)
7. Metadata to include in the search results ...

Yes, I have written basic search algorithims in the past, but I would NEVER recommend one of my scripts for a higher level ecommerce site, I would also be nervous about using googles "black box" solution...

I hate to say it, but search is a vital part of a larger ecommerce site - if you have 50 product pages then I would guess you wouldn't need an advanced search, but say you have 50 different bags, wouldn't you want to search for "laptop" or "Airline"?

Search also doesn't have to be quite so "input in, results out" they can be filtered etc..

I guess the idea of a good site search is to make it easier for the person who knows what they want to buy, to buy the product they want ... without having to understand 'your' navigation...

(and I still think a GOOD ajax suggest is a great idea, I wanted to build a site for a client exclusively around that, but having SEO friendly pages in the background).

Gerry







#35 1dmf

1dmf

    Keep Asking, Keep Questioning, Keep Learning

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,167 posts
  • Location:Worthing - England

Posted 27 May 2009 - 05:15 AM

QUOTE
There is WAY more to a insite search engine than that
1. Stop words (words to ignore by a search engine, typically what, where how
2. Stemming
3. Synomyns (and of course spelling correction)
4. Fuzz Logic,
5. Prioritisation (ranking)
6. which parts of a page to index, not too (should it index related products for example)
7. Metadata to include in the search results ...
your talking indexing and of course error tollerance , auto-correct etc. for a e-Commerce site selling 1,000 of products, i'm not.

For my purpose that is simply overkill , a waste of time, money and of no use to man nor beast.

I'm not trying to write my own google here!

You see many here have sites which sell stuff they don't manufacture, is not an invention or product created by them and merely act as a middle man or affiliate or agent etc.

Something personally I would never touch with a barge pole, i'm not a fan of middle men or retail! But I think I can come up with something which will serve my purpose and the sites i am thinking of intergrating it into.

Not everyone is interesting in becoming a millionaire by selling stuff they are not interested in, or SEOing a company site who sells something they are not interested in.

If that was the case I'd have a "made for adsense" site, and I dont, nor will I and I ditched my affiliate shopping site also for the same reasons.

But if you have some insight into any stemming code, Synomyns or fuzz logic, please do share, any help improving code I write is always appreciated.

Personally i like searches that match what i write, if I make a typo I don't what it to try and second guess what I meant to type, I wan't it to tell me it found no matches and please check my search term, i was always taught with computers GIGO , garbage in -> garbage out!

But I'm the type of guy who hates predictive type for SMS, doesn't make any sense to me, can't get a single word out of that 'feature', but YMMV!



#36 BBCoach

BBCoach

    HR 5

  • Moderator
  • 402 posts

Posted 27 May 2009 - 09:40 AM

QUOTE
There is WAY more to a insite search engine than that
1. Stop words (words to ignore by a search engine, typically what, where how
2. Stemming
3. Synomyns (and of course spelling correction)
4. Fuzz Logic,
5. Prioritisation (ranking)
6. which parts of a page to index, not too (should it index related products for example)
7. Metadata to include in the search results ...


Gerry you're getting WAY too complicated here and you'll find it cheaper and faster to buy a search appliance rather than trying to code what you're describing. I know I went down that path and doing it "right" is easier said than done. You can find an entry-level one for $500 or less on E that'll index 50,000 pages.

I normally don't give specific recommendations on what to use or not use without remuneration (I'm a capitalist pig and proud of it), but if you'll search at G for [search appliance] and evaluate the first one returned, then you'll knock off the five points you listed. Then since one can assign a custom user-agent to the appliance you can then detect that it's hitting your site and serve up only what you want it to index. For example, I don't serve graphic images or the site template and I display only text. It's basically spider-baiting in reverse. I bait my in-house SE. And lastly, metadata is not necessary when you're serving up the exact title, pricing and description that you want indexed.

#37 1dmf

1dmf

    Keep Asking, Keep Questioning, Keep Learning

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,167 posts
  • Location:Worthing - England

Posted 27 May 2009 - 09:55 AM

QUOTE
I normally don't give specific recommendations on what to use or not use without remuneration (I'm a capitalist pig and proud of it)
hysterical.gif no way, who'd have thought it! Really?

I was actually going to do a simple perl script that grabed the home directory for all HTML files, parsed them through a regex to strip out tags / code, so only text is left.

Do a simple text match , I might allow the (+) operator to say; must have this + (and) this + (and) which i've done on some of my inhouse search facilities.

Do a string match (greedy, non-case sensitive), flag page names which have string match and supply a simple hyperlink list of pages matching search, with a text snippet showing the context of the match.

All this can be done 'on-the-fly' , so will always take into account page updates, changes, page removals etc.

I have a good site to test for @ 50pages, not the bells and whistles approach Gerry was implying, but a pretty powerfull search againt web pages. OK, I know if the page is indexed, that's what the G! site search basically does, and probably including alot of what Gerry talks about also...

But it's not as much fun a handrolling your own smile.gif

and as mine will do it 'on-the-fly' it will always be up-to-date unlike the G! index.


#38 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 27 May 2009 - 10:45 AM

1dmf: If you're going to build one yourself the spider is the easier part. There are even several open source spiders out there.

The rest is a bit more difficult because you have to decide how to weight things and more importantly you have to construct your database in a way that is fast and well optimized. There are different ways to do this. I can tell you what I did after considerable thought for some reasonably sized sites I run.

What I did was completely forget about the spidering side of the equation. My search was for the items the store contained, so I didn't care if every page got indexed by my site search. Though I could add any of those I wanted.

Mine is built in PHP. My db is MySQL, or more specifically the search database is a MySQL db that utilizes FullText Indexes. And my MySQL config (/etc/my.cnf on my linux system) on the server is tweaked a bit to allow for 3 letter fulltext searches, as opposed to the default 4 letter search. I got lucky (or good) in that I have my "products" db set up so that I can extract all of the info I need from those fields. Including enough of a hint that I have the url for those individual product pages that I can drop into my site search db.

So I have a normal product database. And I have a separate site search db that is fulltext and uses indexes that gets updated by a little cron job that fires daily to run another php script to pull the info from my products db into the search db, or I can run it manually. And I have a third db that does nothing but collect the searches that were made using the site search so I can analyze them later. And it all works like a charm for my little sites.

#39 1dmf

1dmf

    Keep Asking, Keep Questioning, Keep Learning

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,167 posts
  • Location:Worthing - England

Posted 27 May 2009 - 11:25 AM

hmm, when you say spider, you actually mean scanning the doc, deciding what's important and what isn't and then indexing it some how for searching against later.

I'm looking at a real-time on the fly page scan and match producing results.

how do you decide what are the important words and what aren't?

How do you build an index , create a record with a key of the search word, and then list all URLs of pages which contain the word?

All seems like massive overhead and overkill for a simple string text match search.

If you were searching the product DB wouldn't you write a search which does the SQL query on the fly also?


CODE
SELECT * FROM [products] WHERE [title] LIKE '%search%' OR [description] LIKE '%search%' OR [item] LIKE '%search%'
etc. etc.

You could wrap it in a loop to build up select via the (+) operator as discussed, this is what I currently use..

CODE
sub get_rec {

#_[0] = Table
#_[1] = Display Columns
#_[2] = Search Columns
#_[3] = Order By
#_[4] = Search Words

# Initialse WHERE Variable
my $where = "";

# Split out Search into seperate keywords
my @search = split(/\+/, $_[4]);

# Split out columns to search
my @cols = split(/,/, $_[2]);

# Check for search words and build where statement
if($_[4]){
  my $cnt1 = 0;
  foreach my $sr (@search){
         my $cnt2 = 0;
         $where .= "( ";
           for(@cols){
            $where .= "$_ LIKE '%$sr%'";
            if($cnt2 < $#cols){$where .= " OR ";}
            $cnt2++;
           }
           $where .= " )";
    if($cnt1 < $#search){$where .= " AND ";}
    $cnt1++;
  }
}
else{$where = " 1=1 ";}

# Get matching records
my @rs = &getSQL("$_[0]","$_[1]","$where","$_[3]");

# Return SQL Record Set
@rs;

}


And I could do a split via space and use the OR operator as well.

for the requirment it needed the AND operator so I use the + sign.

OK perhaps some training or a good search FAQ was required and it could use more work to be a bit more flexible, but not one person has ever rung and said they couldn't find what they needed via the search facilities i have provided in my web apps.

I like to keep things as simple as possible, creating an additional DB for holding indexing data is definately a sledge hammer to crack a walnut syndrome for my purpose me think. giggle.gif

#40 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 27 May 2009 - 02:50 PM

You'd kill your server if you did it that way with a typical product db configuration. That's why I maintain a separate db just for search, and have it configured to work the way I want it to work. wink1.gif

I had actually intended to make a blog post or three about the solution I came up with that worked for me. But I never got around to it, partly because it's a one-off solution that I've tweaked to use on my various sites and partly because there are some changes I really had to make to my MySQL configuration that many wouldn't be able to make if they were on a shared hosting platform.

In other words, I got lazy. lol.gif

#41 BBCoach

BBCoach

    HR 5

  • Moderator
  • 402 posts

Posted 27 May 2009 - 04:02 PM

QUOTE
1dmf - All seems like massive overhead and overkill for a simple string text match search.
It is! I spent more than a year researching (while trying to code) internal SE options. I finally pulled out enough hairs and figured, "What the hey! Why try and reinvent the wheel!"

QUOTE
If you were searching the product DB wouldn't you write a search which does the SQL query on the fly also?
Yes, but when a site has many hundreds of categories and thousands of pages you want the internal bot to crawl as a backup to checking site functionality.

QUOTE
hmm, when you say spider, you actually mean scanning the doc, deciding what's important and what isn't and then indexing it some how for searching against later.
Yes and no. Yes it's a spider in the traditional sense and no you don't care what's important and what isn't because you don't know what a searcher maybe thinking when querying you data set. Let them find what they want to find.

QUOTE
said they couldn't find what they needed via the search facilities i have provided in my web apps.
Actually, most simply go somewhere else to look.

QUOTE
I like to keep things as simple as possible, creating an additional DB for holding indexing data is definately a sledge hammer to crack a walnut syndrome for my purpose me think.
Me too and it might be for only 50 pages, but where's the room (plan) for growth?

#42 1dmf

1dmf

    Keep Asking, Keep Questioning, Keep Learning

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,167 posts
  • Location:Worthing - England

Posted 27 May 2009 - 06:22 PM

QUOTE
You'd kill your server if you did it that way with a typical product db configuration.
this is partly what i'm trying to understand, and why. I thought the point of SQL was it is a complicated and evolved DB engine capable of being able to hold millions of records and perform a simple query against the colums/rows in a split second, which also has its own internal indexing and matching algos.

I know a webserver has a limited capacity and that's when you have to get into load balancing, but what is the average capable load of a shared hosting server? or even a dedicated one for that matter.

QUOTE
Yes, but when a site has many hundreds of categories and thousands of pages you want the internal bot to crawl as a backup to checking site functionality.
Well I got a long way to go till I have a site that large! I guess I'm lucky in that respect at least for the time being :omg.gif:


QUOTE
Actually, most simply go somewhere else to look.
well no not in this circumstance, these are inhouse systems, so they have no choice but to use them and I have to support all inhouse apps as well as write them and tweak to order! - So when a member couldn't find what they wanted, I had to make the amendmends to satisfy their requirement, which was simply a matter of adding a field to the search criteria as it wasn't included in the original spec smile.gif

QUOTE
Me too and it might be for only 50 pages, but where's the room (plan) for growth?
Perhaps, but not by much or in the forseable future, article publication has been pulled! sad.gif , but I still might tweak the article search so it scans the article pages instead of the title/description in the RSS XML feed, which it does at present.

All articles are indexed, so i could use the google websearch function, but I like to code my own, keeps it self branded that way, not that G! needs any aditional advertising, though perhaps people seing G! search on your site encourages them to use it as it's a brand they recognise.

[edit] Incidently , what with all this search talk and learning more about PERL variable iterpolation, i added string match highlighting to my dance-music.org album search smile.gif , so thanks for encouragement and input, it's already been put to good use thumbup1.gif

Edited by 1dmf, 27 May 2009 - 06:34 PM.


#43 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 28 May 2009 - 09:41 AM

The issue of server load becomes one of how big the db is and how much memory and cpu capacity the server has available at peak periods.

Parsing through a 50 row db takes less of both than parsing through a 500,000 row db, obviously.

If you construct your db correctly (Indexes and FullText) will solve most of those issues, even in a single server environment, at least until you start talking about millions of rows in a db. Look into database optimization and you'll save yourself a lot of headaches.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

We are now a read-only forum.
 
No new posts or registrations allowed.