ClickAider
You are currently browsing the Bogle’s Blog weblog archives.

Google Adwords links used to spread Computer Viruses

From the Washington Post:

Virus writers have been gaming Google’s “sponsored links” — the paid ads shown alongside search engine results.

According to a report at Exploit Prevention Labs, while the top sponsored links that showed up earlier this week when users searched for “BBB,” “BBBonline” or “Cars.com” appeared to direct visitors to those sites, they initially would route people who clicked on the ads through an intermediate site. The intermediate site attempted to exploit a vulnerability in Microsoft Windows to silently install software designed to steal passwords and other sensitive information from infected PCs. The attackers exploited a flaw in Microsoft’s Internet Explorer Web browser, a problem that the company issued a patch to fix last June.

As Exploit Labs’s Roger Thompson notes in his blog, the bad guys behind the attack appeared to capitalize on an odd feature of Google’s sponsored links. Normally, when a viewer hovers over a hyperlink, the name of the site that the computer user is about to access appears in the bottom left corner of the browser window. But hovering over Google’s sponsored links shows nothing in that area. That blank space potentially gives bad guys another way to hide where visitors will be taken first.

According to Thompson, Google has taken down the offending sponsored links. In fact, searching for “betterbusinessbureau” in Google no longer turns up any sponsored links at the moment.

Nasty stuff.  

This is important to Google and users because it undermines trust in clicking on Adwords links.  (The problem isn’t unique to Google, of course– a similar exploit used banner ads earlier.  )

Your odds of stumbling across a malware site at random or in organic Google search results are low, since the bad guys typically don’t have a good page rank.  But with sponsored links any bad guy can be at the top of the ranking, and a quick redirect can conceal the fact that anything happened.

Advertising networks and search engines need to check linked pages for viruses and phishing, just like software download sites do.

I’ve been thinking that if a search engine wanted to compete with Google, security, privacy, and openness might be a good way to do it.

What if you could use a search engine that offered decent relevance while also guaranteeing:

  • Security:  automated probing of sponsored and free links to ensure that the destination page doesn’t attempt to install spyware or viruses.
  • Privacy: no logging of search terms or cookie tracking.  No attempt to correlate activity across sites.
  • Openness:  an open REST API for accessing the search results under a clearly defined and affordable pricing model.

Syslog-NG and Metrics Analysis

I’m working with Kyle Larson on a project to help Jobster capture and analyze large volumes of data about categorized job impressions and clickthroughs, and wanted to share a useful building block we’ve encountered along the way.   

Dave Nash from our ops team introduced us to Syslog-ng, a drop in replacement for the standard linux syslog daemon. From the Freshmeat project page:

syslog-ng, as the name shows, is a syslogd replacement, but with new functionality for the new generation. The original syslogd allows messages only to be sorted based on priority/facility pairs; syslog-ng adds the possibility to filter based on message contents using regular expressions. The new configuration scheme is intuitive and powerful. Forwarding logs over TCP and remembering all forwarding hops makes it ideal for firewalled environments.

The machine creating the log entry only needs to send it over the network where it’s bufferred and eventually logged on a centralized syslogging server.  The buffering means the web server doesn’t have to wait on a disk or database. Syslog-ng also support load balancing and forwarding if the log traffic exceeds the capacity of a single machine. 

Syslog-ng configuration options allow the log entries to be directed to a variety of destinations (files, named pipes, etc.) based on a fairly rich pattern matching system. One such destination for log entries is mysql by way of a daemon called metricbot, written by Andrew. Metricbot listens on a named pipe to log entries sent by syslog-ng and writes them to a structured mysql database.  Assuming the database can keep up with the insert rate of events, this gives us near real-time import of log entries into the database, without slowing down the rest of the system when the database can’t keep up.

For our purposes, we don’t mind losing log entries in a crash. Syslog-ng allows you to tune how frequently log entries are flushed to disk but doesn’t provide any absolute guarantees that entries will be preserved.

New Office Temps compares the job boards

New Office has a comparison of applicant volume and cost per applicant across different job boards.  The key data is summarized in the table below:

Site

Applicants per ad

Total jobs posted

Cost per applicant

Careerbuilder

82

310

$5

Craigslist

45

341

$0

Monster

4

150

$67

Hotjobs

2

70

$137

The reported data an analysis overwhelmingly favor Careerbuilder over Monster and Hotjobs. In terms of cost per prospect, Craigslist was the absolute leader, although the analysis soft-pedals this fact.  Craigslist has a broken link to the press release on their job board comparison page, misattributing the information to Yahoo Finance.

I can’t vouch for the methodology or lack of bias of the study, which was reported in a press release.

Both Careerbuilder and New Office are located in Chicago. 

Possible sources of bias include the Chicago locations of the jobs posted in the study and the fact they were all temp positions.

In November 2006, New Office released a very similar press release with essentially the same conclusions; Careerbuilder again enjoyed a 10 to 1 advantage in terms of applicants per ad.

Update: Just to give you an idea of how hard it is to find hard objective comparisons Net-Temps has a similar comparison, except that in their study Yahoo Hotjobs comes out on tops and Net-Temps is number 2. 

Resolution for IE6 and Flash XML issues

These days, web search is becoming an indispensable debugging tool.

A week or so ago, we finally fixed the bug that prevented the Blog Buddy widget from working correctly in IE6.  

This was a challenging issue to debug because the widget worked perfectly on my personal web server but failed on the main jobster.com site.  If not for the web, we might never have found the answer; fortunately a search revealed this post on the MediaCatalyst Blog:

The issue is this: the Flash player in IE6 cannot correctly load (xml) files from web servers that use HTTP compression and no caching. In other words, it does not correctly load any file that returns the following combination of the HTTP response headers:

Cache-Control: no-cache
Content-Encoding: deflate

No other browser’s Flash player has this bug, this only occurs in IE6.The problem could only be ‘fixed’ by either disabling HTTP compression, or by removing the no-cache header for this file and user-agent IE6.

I didn’t see the issue on my machine because it wasn’t set up to compress content, whereas our production Apache server was. 

The simplest workaround in our case turned out to be use the “no-store” Cache-Control header rather than “no-cache”.  (Simply removing the no-cache header will not force the browser to refetch the XML feed on each access, which is often not be the desired behavior.)

Rootkit

Like Rich Tong, my fully patched home Windows box seems to have beem infected by a particularly nasty rootkit that is invisible to standard anti-spyware and anti-virus software; the main symptoms are randomly named six letter EXEs that generate browser popups.  RootkitRevealer from Sysinternals shows evidence of the rootkit but doesn’t identify it precisely and can’t clean it up on its own.  (Will System Restore help me here? )

The most dismaying thing is that some of the popups generated are for an anti-virus product call WinAntiVirusPRO, which is perhaps knowingly profiting off of the infect machines. I hope everyone avoids that software.

Nationwide Blackberry outage caused by non-critical system update

From BlackberryToday:

The Wall Street Journal is reporting that the BlackBerry outage we all felt earlier this week was caused by a non-critical system upgrade and routine that had not been tested enough. This caused a chain reaction that stopped e-mails from flowing to many businesses, governments, and individuals, according to the The New York Times.

Research In Motion (RIM), of course, did not think the software would cause a problem like this, as it was intended to improve the system. The Times says RIM reacted quickly and tried to switch service to its backup system, which had been tested and worked in the past. Unfortunately, this time the backup system did not work, delaying service even more.

Is Google reading your mail to improve ranking?

It’s one thing to rank pages based on publically posted links, another thing to use supposedly private links in emails or subscriptions to improve ranking quality.

This article on Problogger describes a recent Google blogsearch patent filing:

Is Google Reading Your Mail?!

Read this carefully:

[0044] References to the blog document by other sources may be a positive indication of the quality of the blog document. For example, content of emails or chat transcripts can contain URLs of blog documents. Email or chat discussions that include references to the blog document is a positive indicator of the quality of the blog document.

Are you thinking what I’m thinking?! Google has a massively popular hosted email service - GMail. They also have Google Talk, a chat service. You probably knew that. But did you know Google has intentions of crawling the content of your GMail emails and Google Talk chat sessions?! Now, I don’t know if they actually do that or not, and I haven’t gone hunting thru their terms of service seeking clarity, but their stated aim is clear: to find URLs in two key forms of personal online communications (email and chats), and to use these discoveries to further rank blogs and blog posts.

I have to say it makes perfect sense. Why? Because Google is looking to build a more and more accurate profile of your and my blog. And to do this Google wants to see corroborating evidence of popularity across as many different “media” as possible: web pages, blog posts, search results click patterns, blogrolls, social bookmarking services, and now email and chat session content. Wow… that’s called being thorough.

Using links in email to improve the quality of blog searching might seem harmless, but this is a real slippery slope.    

As automated analysis of natural language improves, and more and more services come under the Google umbrella, the possibilities for conflict of interest and abuse grow more numerous.  

When does mining private data for public use go too far?  Could Google mine emails for private stock tips and use those to create a public portal of investment tips, for example?

Running Ruby in the Browser

Running Ruby in the Browser is a proof of concept demo of embedding Ruby in a web browser using JRuby:

This is an experiment to run Ruby in the browser. It uses an embedded JRuby applet, and the page communicates to that puppy via LiveConnect. This simple example shows running code via script type="text/jruby" (which is run automatically once, which you see in the output below), and from a form itself. From here we will allow smart bidirectional talk between the DOM and Ruby, and offer up a nice abstraction layer to hopefully enable something like:

<script type="text/ruby">
  document.ready do |dom|
    dom["table tr"] <<"<td>test</td>"
  end
</script>

Alexa blames stupid site owners for stat inconsistencies

Don’t get me wrong, Alexa can be a valuable service, but their data doesn’t match up with our own carefully analyzed logs, Mediametrics, and other services like compete.com.

Rather than simply acknowledging the possibility of sampling errors, a blog post from Alexa titled Alexa Data vs. Your Raw Logs shifts all blame to site owners, saying that they don’t know how to analyze their own logs:

We occasionally hear from users asking why Alexa’s traffic data for a site doesn’t match the data from their site’s logs…

Few individuals are sophisticated enough to read their logs and understand what they mean… You can’t reliably detect fraud or crawlers or many of the other factors mentioned above that have a drastic impact on your reported visitor number.

Anyone, anywhere, can download the Alexa toolbar and contribute to the stats.  There’s not a single mention of possibly inaccuracies caused by the fact that Alexa’s “panel” is self-selected, nor any details behind Alexa’s seemingly magical claim to be able to distinguish fradulent users of the toolbar from legitimate ones.

Rather than increasing trust in a service, self-serving propaganda like this decreases my level of trust. 

 (For the record, some of the other sites like compete seem to have much more representative stats than Alexa these days.)

Career Advice Portal in Jobster Labs

Check out the new and improved career advice portal in Jobster labs! 

The portal lets you search for career advice from a comprehensive but carefully chosen set of sites, reviewed by a career professional for quality.

The results are higher quality than standard Google results, because many of the top hits on Google are from sites that have great SEO but low quality content. Building on Google Coop, we can filter out the noise and offer only high quality, categorized results. 

Props to Robby for working his design and usability magic on the portal.