<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Wavelengths: The Lightcrest Blog</title>
	<atom:link href="http://www.lightcrest.com/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://www.lightcrest.com/blog</link>
	<description>A Blog About Complex Managed Hosting, Enterprise Search, and Lightcrest Miscellany</description>
	<lastBuildDate>Fri, 06 Jan 2012 00:10:12 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Exploring Standard Error</title>
		<link>http://www.lightcrest.com/blog/?p=147</link>
		<comments>http://www.lightcrest.com/blog/?p=147#comments</comments>
		<pubDate>Fri, 06 Jan 2012 00:10:12 +0000</pubDate>
		<dc:creator>Zach Fierstadt</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.lightcrest.com/blog/?p=147</guid>
		<description><![CDATA[When dealing with large data sets, one strives to make accurate inferences
based on samples, often with the aim of making critical decisions based on
evidence, not just "educated guesses" or optimistic hindsight leading
to "we think we improved the situation." How much did that new
taxonomy actually improve the search experience? How confident
are we that the new relevancy model lead to increased click-through
for those top 100 skus? How do we approximate our bandwidth
utilization on a 10 gigabit link without collecting every single packet?

 <a href="http://www.lightcrest.com/blog/?p=147">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>When dealing with large data sets, one strives to make accurate inferences<br />
based on samples, often with the aim of making critical decisions based on<br />
evidence, not just &#8220;educated guesses&#8221; or optimistic hindsight leading<br />
to &#8220;we think we improved the situation.&#8221; How much did that new<br />
taxonomy actually improve the search experience? How confident<br />
are we that the new relevancy model lead to increased click-through<br />
for those top 100 skus? How do we approximate our bandwidth<br />
utilization on a 10 gigabit link without collecting every single packet?</p>
<p><a href="http://www.lightcrest.com/site_media/pdfs/SE_BLOG.pdf">Exploring Standard Error &#038; CLT</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.lightcrest.com/blog/?feed=rss2&amp;p=147</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scaling Search: Data Indexing and Storage (Part 2 of 3)</title>
		<link>http://www.lightcrest.com/blog/?p=140</link>
		<comments>http://www.lightcrest.com/blog/?p=140#comments</comments>
		<pubDate>Fri, 21 Jan 2011 00:56:09 +0000</pubDate>
		<dc:creator>Michael Hughes</dc:creator>
				<category><![CDATA[Enterprise Search]]></category>
		<category><![CDATA[Network Engineering]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[document processing]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[latency]]></category>
		<category><![CDATA[shards]]></category>

		<guid isPermaLink="false">http://www.lightcrest.com/blog/?p=140</guid>
		<description><![CDATA[Available Platforms This is the second post in a 3-post series about scaling enterprise search.  Today&#8217;s post will focus on the actual indexing engines and platforms that are commonly in use today, as well as provide some light overview of &#8230; <a href="http://www.lightcrest.com/blog/?p=140">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><strong>Available Platforms</strong></p>
<p>This is the second post in a 3-post series about scaling enterprise search.  Today&#8217;s post will focus on the actual indexing engines and platforms that are commonly in use today, as well as provide some light overview of hardware requirements.</p>
<p>The search engine landscape is populated with both open source and proprietary platforms.  In the open source world, the most prominent and widely used platform is Solr, which is built on the search library Lucene.  Both of these projects are from the Apache Software Foundation and are commercially supported by Lucid Imagination.  Other platforms such as Sphinx provide special integration with databases such as MySQL.  One of the most common functions of a search engine is to provide high-volume, high-performance searching of information that is stored long-term in a database.</p>
<p><span id="more-140"></span></p>
<p>The proprietary search engine world is dominated by a few key players.  Microsoft (FAST ESP), IBM, Oracle, and Autonomy all have developed and marketed enterprise search platforms for use in high end enterprise environments.  These platforms usually provide a bevy of options and features such as many data source plug-ins, document processing/linguistics features, and auto-scaling capabilities designed to help your organization get the most performance as possible from your hardware and software investment.</p>
<p><strong>What&#8217;s Taking So Long? </strong></p>
<p>No matter which platform you end up choosing, one of the challenges with creating large search indexes is dealing with the sheer volume of data that is both consumed and created by the process of indexing.  During indexing, a search engine may require the crawling, processing, and storing of millions (or billions) of documents, many of which can be several megabytes in size.</p>
<p>Simplifying things a bit for explanation, when a search engine indexes a document corpus, it builds what&#8217;s called a reverse index.  This is the same type of index that is found at the back of a book.  It is a list of words followed by a list of documents (or pages) that contain them.  Every word (or token) in every document must be read, cataloged, analyzed, and collated into this reverse index.  If any linguistics processing (such as synonym expansion or stemming) has been performed on the document during document processing, this job can grow to be even larger than it appears at face value.  In addition to the text that is found in a document, many other properties of a document, such as its size, creation date, and owner can be indexed as well.  This information is called metadata and can provide much more granularity in a search application.</p>
<p>Needless to say, this kind of computationally expensive work can add up to a considerable amount of computation.  One way of overcoming this large task is to divide and conquer.  Many search engines today consist of a federation of machines, all designed to handle a small portion of the document corpus.  By splitting the data into shards, each computer in the cluster can process and index a more manageable portion of data.  Many enterprise search platforms (such as FAST ESP) perform this work transparently.  Others require a bit of intelligence on the application developer&#8217;s part to determine how to route data for indexing.</p>
<p>In addition to indexing small shards of data across many machines, an enterprise-grade search engine requires redundant systems in case of high load, equipment failure, or query performance (which will be covered in the next article in this series).  Oftentimes, indexing is performed on one &#8220;row&#8221; of machines (which together represents one copy of the document corpus), and the built indexes are then copied to the redundant machines.  This way, index consistency and query performance are ensured for a very small cost in terms of network traffic and latency.</p>
<p><strong>Hardware Requirements</strong></p>
<p>In the enterprise search world, sizing exercises are notoriously difficult to get right.  Usually they involve an experienced search engineer&#8217;s educated estimates backed up by rigorous testing in development environments.  For example, to estimate the storage space needed for 1M documents, a sizing exercise might consist of performing 10 runs of 100k documents and recording the average size of the built index on disk.  This data can then be used to extrapolate requirements for larger corpuses.  Similar trials can be performed to determine indexing latency, total time to index, and query performance.  Although these estimates will never be perfectly accurate, they hold up surprisingly well to production trials, and as long as you build in a fair amount of runway, no emergencies should be encountered.  Remember, you can always add more shards and re-index again later.  In fact, you will more than likely need to do this as your search engine grows.</p>
<p>Most enterprise installations consists of 10 &#8211; 100 machine clusters, each consisting of 2 &#8211; 5 rows of 5 &#8211; 20 shards.  These machines are generally multi-core 64-bit computers with very fast network and disk subsystems, generally at least gigabit ethernet and SAS RAID or SAN solutions.  These of course must be connected to high-end network switches, which can raise costs substantially.</p>
<p><strong>Software as a Service</strong></p>
<p>Because running an enterprise search engine in-house is generally a serious operational commitment, many organizations are instead opting for a SaaS model in which they pay a service provider a monthly fee based on the amount of documents they are trying to index, the amount of space the documents take up on disk, the number of queries they wish to issue against this index, or a combination of these three metrics.  By automating the maintenance and scalability of enterprise search, search vendors offer a very attractive alternative to in-house management at a price point that is generally a fraction of the cost.</p>
<p>The next post in this series will focus on the query side of a search engine &#8211; how a search engine processes text requests, queries the actual search index, and delivers results to a user.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lightcrest.com/blog/?feed=rss2&amp;p=140</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Building a custom package (Ruby 1.9.2-p0) using RPM</title>
		<link>http://www.lightcrest.com/blog/?p=122</link>
		<comments>http://www.lightcrest.com/blog/?p=122#comments</comments>
		<pubDate>Thu, 23 Dec 2010 19:07:35 +0000</pubDate>
		<dc:creator>Taylor</dc:creator>
				<category><![CDATA[Customer Service]]></category>
		<category><![CDATA[Managed Hosting]]></category>
		<category><![CDATA[Package Management]]></category>
		<category><![CDATA[Ruby]]></category>
		<category><![CDATA[Software Development]]></category>
		<category><![CDATA[package management]]></category>
		<category><![CDATA[rpm]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[sysadmin]]></category>
		<category><![CDATA[systems administration]]></category>

		<guid isPermaLink="false">http://www.lightcrest.com/blog/?p=122</guid>
		<description><![CDATA[From time to time the systems team here at Lightcrest builds custom packages for our clients that allow for easy, repeatable roll-outs of development or production environments.  We keep a central version control repository (svn or git, depending on customer &#8230; <a href="http://www.lightcrest.com/blog/?p=122">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>From time to time the systems team here at Lightcrest builds custom packages for our clients that allow for easy, repeatable roll-outs of development or production environments.  We keep a central version control repository (svn or git, depending on customer preference) of these packages in both their source and binary formats for easy administration and quality consistency.</p>
<p>This blog post will give a brief overview on how to build custom RPM packages that can then be installed onto multiple systems through your preferred package management and deployment utility.  For the purpose of this document, we will be packaging the latest ruby (1.9.2-p0) for CentOS 5.  Since we don&#8217;t want to override the CentOS provided ruby packages due to version conflicts and potentially breaking the provided ruby gems, we will be installing this into /opt.</p>
<p><span id="more-122"></span></p>
<p>Before we dive into spec file creation, we will need to install the packages that allow us for building rpms.  You will want to use yum as follows to install the &#8216;rpm-build&#8217; package along with its dependencies.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">$ sudo yum check-update</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">$ sudo yum install rpm-build</span></span></pre>
<p>Or alternatively, using apt:</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">$ sudo apt-get update</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">$ sudo apt-get install rpm-build</span></span></pre>
<p>Once that&#8217;s completed, you will now have a new directory structure build out in /usr/src/redhat.  You should now see the following directories which rpm-build will use to grab its contents for package building.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">$ ls /usr/src/redhat/</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">BUILD RPMS SOURCES SPECS SRPMS</span></span></pre>
<p>Note that if you&#8217;re building this package as a non-root user (preferred), you can create the same directory strucutre in $HOME to complete this project.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">mkdir -p $HOME/build/{BUILD,RPMS,SOURCES,SPECS,SRPMS}</span></span></pre>
<p>Ok, now we&#8217;re ready to begin!</p>
<p>Lets start first with grabbing the correct source tarball from the distributor.  You can do this by using wget, an ftp client, or simply downloading it through your browser and uploading it to the correct location.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">$ wget ftp://ftp.ruby-lang.org//pub/ruby/1.9/ruby-1.9.2-p0.tar.gz</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">$ mv ruby-1.9.2-p0.tar.gz ~/build/SOURCES</span></span></pre>
<p>Now its time to build out your spec file.  A spec file is a set of instructions that tells rpm-build to execute in order to build the application with the correct settings, and then install them into the correct location.  Choose your preferred editor and open up your spec file.</p>
<pre>$ vi ruby192p0.spec</pre>
<p>Since we&#8217;re not installing this package into the default /usr and/or /usr/local location, we need to override some of the default marcos provided by redhat.  You&#8217;ll want to put the following at the top of your spec file:</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%define _prefix		/opt/local</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%define _localstatedir	/opt/local/var</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%define _mandir		/opt/local/man</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%define _infodir	/opt/local/share/info</span></span></pre>
<p>By default, those macros would typically live in /usr and /var, rather than /opt/local.  The url seen below is a great source of information for predefined macros provided by rpm.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">https://www.zarb.org/~jasonc/macros.php</span></span></pre>
<p>Next step is to define the name and version of the package.  Since we want this package to live side-by-side with the CentOS provided ruby package, we&#8217;ll need to make some slight adjustments to the package name.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%define rubyver		1.9.2</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%define rubyminorver	p0</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">Name:		ruby%{rubyver}%{rubyminorver}</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">Version:	%{rubyver}%{rubyminorver}</span></span></pre>
<p>Now lets populate the rest of the headers.  The release headed is always incremented by 1 as you complete successful builds and add or remove additions.  The rest of the headers are simply showing where the credit should be given.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">Release:	1%{?dist}</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">License:	Ruby License/GPL - see COPYING</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">URL:		http://www.ruby-lang.org/</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">Source0:	ftp://ftp.ruby-lang.org/pub/ruby/ruby-%{rubyver}-%{rubyminorver}.tar.gz</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">Summary:	An interpreter of object-oriented scripting language</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">Group:		Development/Languages</span></span></pre>
<p>Now we must define our build root and build requires.  A buildroot is a temporary directory where the rpm is installed before it is packaged.  This is done to not over write any system files by accidental installation.  Once the rpm is packaged, the build root is removed.</p>
<p>Next is the build requires.  This is a set of packages that must be installed prior to building your rpm.  Without these packages your build will certainly fail, or not build as intended.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">BuildRoot:	%{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">BuildRequires:	readline readline-devel ncurses ncurses-devel gdbm gdbm-devel glibc-devel tcl-devel tk-devel libX11-devel gcc unzip openssl-devel db4-devel byacc</span></span></pre>
<p>Now we define a description of the package.  We did define a summary of the package in the headers, but this gives a more in depth look to exactly what the package provides.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%description</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">Ruby is the interpreted scripting language for quick and easy</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">object-oriented programming.  It has many features to process text</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">files and to do system management tasks (as in Perl).  It is simple,</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">straight-forward, and extensible.</span></span></pre>
<p>Enough with the descriptions of the package, lets move onto the build process.  First we must define the %prep and %setup options.</p>
<p>The %prep script creates the &#8216;BuildRoot&#8217; which we defined earlier.</p>
<p>The %setup script unpacks the source tarballs.  In our example below we define the -n flag to tell setup which directory to change into after it unpacks the tarball.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%prep</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%setup -n ruby-%{rubyver}-%{rubyminorver}</span></span></pre>
<p>Next we have the %build script, which is the actual set of instructions for compliation.  First we define our CFLAGS (which were taken from the base ruby package provided by CentOS) and export them.  We then run the %configure (ie, ./configure) script and pass along any arguments we see fit for our build.  Last, we run &#8216;make&#8217; which will compile the application, along with providing %{?_smp_mflags} to optimize our build depending on our cpu architecture.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%build</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">CFLAGS="$RPM_OPT_FLAGS -Wall -fno-strict-aliasing"</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">export CFLAGS</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%configure \</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">--enable-shared \</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">--disable-rpath</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">make RUBY_INSTALL_NAME=ruby %{?_smp_mflags}</span></span></pre>
<p>Next we have the %install script, which is the set of instructions for installing the package into the build root.  First we remove the build root completely in order to remove any previous fragmented builds that failed.  Then, we run &#8216;make install&#8217; and define the DESTDIR argument so it knows to install into the build root, and not over any system files.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%install</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">rm -rf $RPM_BUILD_ROOT</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;"># installing binaries ...</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">make install DESTDIR=$RPM_BUILD_ROOT</span></span></pre>
<p>Then, we want to again clean out the build root once our build is completed.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%clean</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">rm -rf $RPM_BUILD_ROOT</span></span></pre>
<p>And now we want to define the files that need packaging.  RPM will look for these files in the build root.  In this case, we defined documentation files from the source tarball, and anything that lives in the build root, in /opt/local/*</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%files</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%defattr(-, root, root)</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%doc README COPYING ChangeLog LEGAL ToDo</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%{_prefix}/*</span></span></pre>
<p>And last but not least we want to update our change log.  This provides an easy way to keep track of changes and upgrades going forward.</p>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">%changelog</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">* Wed Dec 22 2010 Taylor Kimball &lt;taylor@lightcrest.com&gt; - 1.9.2-p0-1</span></span></pre>
<pre><span style="font-family: Georgia, 'Bitstream Charter', serif; color: #444444;"><span style="line-height: 22px;">- Initial build for el5 based off of el5 spec.</span></span></pre>
]]></content:encoded>
			<wfw:commentRss>http://www.lightcrest.com/blog/?feed=rss2&amp;p=122</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Using Twisted for Rapid Application Development</title>
		<link>http://www.lightcrest.com/blog/?p=35</link>
		<comments>http://www.lightcrest.com/blog/?p=35#comments</comments>
		<pubDate>Wed, 15 Sep 2010 01:23:38 +0000</pubDate>
		<dc:creator>Zach Fierstadt</dc:creator>
				<category><![CDATA[Customer Service]]></category>
		<category><![CDATA[Managed Hosting]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Sales]]></category>
		<category><![CDATA[Software Development]]></category>

		<guid isPermaLink="false">http://www.lightcrest.com/blog/?p=35</guid>
		<description><![CDATA[Introduction to BridgeBot Hey guys, I thought I&#8217;d drop my first post with something potentially useful for folks out there who love to write python and happen to need protocol bridging for their chat systems. As you may or may &#8230; <a href="http://www.lightcrest.com/blog/?p=35">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<h1>Introduction to BridgeBot</h1>
<p>Hey guys, I thought I&#8217;d drop my first post with something potentially useful for folks out there who love to write python and happen to need protocol bridging for their chat systems. As you may or may not know, Lightcrest has a chat system in place that allows users to interact with our sales and engineering staff. Rather than purchase a third party application, we decided to build it ourselves so we could extend it in the future (also &#8211; why buy something when you can build it in four hours?).</p>
<p>Our chat system runs off a custom Flex app that talks to a custom ratbox IRC daemon. When the Flex app loads, it talks to our custom IRC daemon over a TCP socket and initiates, registers, and funnels messages as any other IRC client would.</p>
<p><span id="more-35"></span></p>
<p>Now obviously we don&#8217;t want our staff to sit around and watch their IRC terminals all day. We wanted the ability to have the chat system alert our staff efficiently from any device with an AIM client &#8211; whether it be an iPhone, Droid, or home workstation. Would you want to have your sales staff learn IRC? We wouldn&#8217;t.</p>
<h1>Twisted Framework</h1>
<p>So off we went to the Python Twisted Framework. If you haven&#8217;t used Twisted before, it takes a little getting used to &#8211; but once you&#8217;ve got it working you&#8217;ll never want to write a select() or poll() loop ever again. Twisted essentially abstracts all the code you have to rewrite for every single network app you build, and allows you to chain your app functionality in the form of &#8216;chained deferreds&#8217;, which are essentially just callbacks that are assigned to an asynchronous event.</p>
<p>Being a pragmatist, I&#8217;m not going to go over how Twisted actually works or what it actually is &#8211; they do it far better on the official site at <a href="http://twistedmatrix.com/documents/current/core/howto/index.html">twistedmatrix.com</a>.</p>
<p>In this case, we wanted to bridge two protocols. How do you get two Twisted factories to talk to each other? While this may seem obvious, once you play with the framework a bit you realize it&#8217;s up to you to figure out how to bridge communications between multiple clients within the same Twisted reactor. I hope this snippet of code makes your life easier down the road (and perhaps you&#8217;ll come up with even better ways of doing it).</p>
<p>We decided to simply stack connection objects into a queue, and share the queues between the factories/clients.</p>
<h1>For the Propeller Heads</h1>
<p>So here&#8217;s the code. Thanks to ActivePython and thus Richard Stevens for the unix process handling code. If you don&#8217;t know who Richard Steven is, he&#8217;s the guy who wrote TCP/IP Illustrated Volumes 1 &#8211; 3 (a must have for any software engineer).</p>
<p>Please note this isn&#8217;t a tutorial. I&#8217;m going to run through the code from the top down, so if the flow of the blog is funny, my apologies in advance.</p>
<pre># BridgeBot - AIM &lt;-&gt; IRC bridge written
# for Lightcrest chat interface. 

import os
import re
import sys
from twisted.words.protocols import irc,oscar
from twisted.internet import protocol,reactor</pre>
<p>Nothing too interesting here other than the importing of the namespaces that contain the Oscar and IRC classes we need to leverage the respective protocols.</p>
<p>First, lets define the configuration parser. This function will parse a very basic config that allows you to list AIM screennames that should be alerted on any incoming message from the website.</p>
<pre>class SetConfig:
   """ Set aim handles for future broadcasting"""
   def __init__(self,config_file):
      self.aim_handles = []
      for line in open(config_file).read().split('\n'):
         if len(line) and line[0] != "#":
            self.aim_handles.append(line.strip())</pre>
<p>This will let let us parse AIM SN&#8217;s out of a flat file like this:</p>
<pre>[zfierstadt@foobar bridgebot]$ cat aimhandles.cf
# This configuration file lists AIM sn's
# to relay BridgeBot messages from customers
# to LC staff.
zachflap909
shanpors62
mhu99
nclacemel
#skycrane
+18888575309</pre>
<p>Now lets define our BridgeBot class. This will be the code responsible<br />
for the IRC side of the equation.</p>
<pre>class BridgeBot(irc.IRCClient):
    def _get_nickname(self):
        return self.factory.nickname

    def _get_myqueue(self):
       return self.factory.myqueue

    def _get_aimqueue(self):
       return self.factory.aimqueue

    nickname = property(_get_nickname)
    myqueue = property(_get_myqueue)
    aimqueue = property(_get_aimqueue)

    def signedOn(self):
        self.join(self.factory.channel)
        print "Signed on as %s." % (self.nickname,)
        # add this connecton to my global queue
        self.myqueue.append(self)

    def joined(self, channel):
        print "Joined %s." % (channel,)

    def privmsg_quick(self,recip,message):
       self.sendLine("PRIVMSG %s :%s" % (recip,message))

    def irc_PRIVMSG(self, prefix, params):
        """
        Called when we get a message.
        """
        user = prefix.split("!")[0]
        channel = params[0]
        message = user + ": " + params[-1]

        self.aimqueue[0].broadcastMessage(message)

        if debug:
           print "Received PRIVMSG: %s %s %s" %(user,channel,message)</pre>
<p>Here is the actual Factory for BridgeBot.</p>
<pre>class BridgeBotFactory(protocol.ClientFactory):
    protocol = BridgeBot

    def __init__(self, channel, nickname='bridgebot', myqueue=[]
                       ,aimqueue=[]

       self.channel = channel
       self.nickname = nickname
       self.myqueue = myqueue
       self.aimqueue = aimqueue

      def clientConnectionLost(self, connector, reason):
         print "Lost connection (%s), reconnecting." % (reason,)
         connector.connect()

      def clientConnectionFailed(self, connector, reason):
         print "Could not connect: %s" % (reason,)</pre>
<p>Now for the Oscar implementation. This will be the AIM side of  the equation, allowing us to relay messages from the IRC connection to the AIM connection.</p>
<pre>class BosConn(oscar.BOSConnection):

    capabilities = [oscar.CAP_CHAT]

    def initDone(self):
        self.requestSelfInfo().addCallback(self.gotSelfInfo)
        self.requestSSI().addCallback(self.gotBuddyList)
        # Add this connection to my global queue
        self.myqueue.append(self)

    def gotSelfInfo(self, user):

        if debug: print user.__dict__
        self.name = user.name

    def gotBuddyList(self, l):

        if debug: print l
        self.activateSSI()
        self.setIdleTime(0)
        self.clientReady()

    def receiveMessage(self, user, multiparts, flags):

        if debug: print user.name, multiparts, flags
        if debug: print "multiparts!! ", multiparts

        # auto messages should not be responded to. identify them by

        # the string auto, found in flags[0] (sometimes).

        try:
            auto = flags[0]
            if auto == "auto":
                return
        except IndexError:
            pass

        self.lastUser = user.name
        message=self.modifyReturnMessage(multiparts)
        try:
           user = message.split(":")[0]
           message = message[len(user)+1:]
           self.ircqueue[0].privmsg_quick(user,message)
        except:
           pass

    def broadcastMessage(self,message):

       for username in self.bcast:
          if debug: print "Broadcasting to %s" %(username)
          self.sendMessage(username,message, wantAck = 1, \
                        autoResponse = (self.awayMessage!=None))\
                        addCallback(self.respondToMessage)

    def respondToMessage(self, (username, message)):

        if debug: print "in respondToMessage"
        pass

    def receiveChatInvite(self, user, message, exchange, fullName \
                           ,instance, shortName, inviteTime):
        pass

    def extractText(self, multiparts):

        message = multiparts[0][0]
        match = re.compile("&gt;([^&gt;&lt;]+?)&lt;").search(message)
        if match:
            return match.group(1)
        else:
            return message

    def modifyReturnMessage(self, multiparts):
        if debug: print "in modifyReturnMessage"
        message_text = self.extractText(multiparts)
        multiparts[0] = (message_text,)

        return multiparts[0][0]</pre>
<p>Now we subclass OscarAuthenticator and add our own interesting bits<br />
to BOSClass so they can be accessible when the reactor starts. Here we<br />
define the separate queues for later manipulation.</p>
<pre>class OA(oscar.OscarAuthenticator):
   BOSClass = BosConn

   def __init__(self,username,password,deferred=None,icq=0 \
                ,myqueue=[],ircqueue=[],bcast=[]):

      self.username=username
      self.password=password
      self.deferred=deferred
      self.icq=icq

      # Make our global queues accessible to BOS
      self.BOSClass.myqueue = myqueue
      self.BOSClass.ircqueue = ircqueue
      self.BOSClass.bcast = bcast</pre>
<p>Great. So the framework is built &#8211; now we need to wrap everything up<br />
in a daemon that can be launched and eventually throw into an init script.</p>
<p>Here we define the process fork, file descriptor clean up, and option parsing.</p>
<pre>if __name__ == "__main__":

    # Parse options
    from optparse import OptionParser

    usage = "usage: %prog [options] arg"
    parser = OptionParser(usage)
    parser.add_option("-f", "--foreground" \
         ,dest="foreground",action="store_true" \
         ,help="run bridgebot in foreground")
    parser.add_option("-d", "--daemon" \
                      ,dest="daemon",action="store_true" \
                      ,help="fork into daemonized process")
    parser.add_option("-v", "--verbose",
                      dest="verbose",action="store_true" \
                      ,help="print debug output to standard output")

    # Initialize option states
    debug = False
    background = False
    foreground = False

    (options, args) = parser.parse_args()

    if options.foreground and options.daemon:
       parser.error("options -f and -d are mutually exclusive")
    elif options.foreground:
        foreground = True
    elif options.daemon:
        background = True
    elif options.verbose:
        debug = True
    else:
        parser.print_help()
        sys.exit(1)

    # Default daemon parameters.
    # File mode creation mask of the daemon.
    UMASK = 0

    # Default working directory for the daemon.
    WORKDIR = os.environ.get("BRIDGEBOT_DIR")

    # Fork child process into the background
    # if we're in daemon mode.

    # Otherwise, set pid to 0 and launch in the
    # foreground. 

    if(background):
       pid = os.fork()
    else:
       pid = 0

    if (not pid):

       if(background):

          # Change to working directory
          os.chdir(WORKDIR)

          # Give child complete control over permissions.
          # the parent, so we give the child complete
          # control over permissions.
          os.umask(UMASK)

          # Child writes PID to disk
          pid_fd = open(WORKDIR+"/run/bridgebot.pid","w")
          pid_fd.write(str(os.getpid()))
          pid_fd.close()

          # irc settings
          chan = "#blah"
          # aim settings

          screenname = "bridgebot"
          password = "RjadsDF!@#m!"
          hostport = ('login.oscar.aol.com',5190)
          icqMode =  0

          # initialize object queues so we can share protocol methods
          # across disparate factory and protocol instances
          ircqueue = []
          aimqueue = []

          # set list of AIM handles for broadcasting

          config = SetConfig('aimhandles.cf')

          # There she blows.
          protocol.ClientCreator(reactor, OA, screenname, password,\
          icq=icqMode,myqueue=aimqueue,ircqueue=ircqueue,\
          bcast=config.aim_handles).connectTCP(*hostport)

          reactor.connectTCP('localhost',9666,BridgeBotFactory(chan,\
          "lightcres",\
          myqueue=ircqueue,aimqueue=aimqueue))

         reactor.run()

         else:
            # Parent exits
            sys.exit(0)</pre>
<p>And there you have it. Now we can launch our shiny bridgebot.</p>
<pre>[zfierstadt@foobar bridgebot]$ python bridgebot.py
usage: bridgebot.py [options] arg

options:
  -h, --help        show this help message and exit
  -f, --foreground  run bridgebot in foreground
  -d, --daemon      fork into daemonized process
  -v, --verbose     print debug output to standard output</pre>
<p>But before we get too excited, lets wrap this up in a nice init script &#8211; we don&#8217;t<br />
want to create more work for our systems administrator.</p>
<pre>[zfierstadt@foobar bridgebot]$ cat init/bridgebot
#!/bin/bash
#
# bridgebot	Script to control process bridgebot
#
# Author:       Zach Fierstadt
#
# chkconfig: - 90 10
# description:  Starts and stops process bridgebot

# Source function library.
. /etc/init.d/functions

# The bridgebot working directory
BRIDGEBOT_DIR=/home/zfierstadt/bridgebot
# The location of the bridgebot pid file

start() {
	action $"Starting process bridgebot: " /usr/bin/python \
        $BRIDGEBOT_DIR/bridgebot.py -d
}

stop() {

        PID=`cat $BRIDGEBOT_DIR/run/bridgebot.pid`
	action $"Shutting down process bridgebot: " kill -9 $PID
}

# See how we were called.
case "$1" in
  start)
	start
	;;
  stop)
	stop
	;;
  restart|reload)
	stop
	start
	;;
  *)
	echo $"Usage: $0 {start|stop|restart|reload}"
	exit 1
esac

exit 0</pre>
<p>Now we can start and stop bridgebot as a service.</p>
<pre>[zfierstadt@foobar init]$ ./bridgebot start
Starting process bridgebot:                                [  OK  ]
[zfierstadt@foobar init]$</pre>
<p>Now let&#8217;s see this in action. I&#8217;m going to load http://www.lightcrest.com.</p>
<p><a href="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-5.01.17-PM1.png"><img class="aligncenter size-large wp-image-80" title="Screen shot 2010-09-14 at 5.01.17 PM" src="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-5.01.17-PM1-1024x794.png" alt="" width="640" height="496" /></a></p>
<p>Let&#8217;s send a request as a customer might over the Flex widget:</p>
<p><a href="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-5.04.31-PM.png"><img class="aligncenter size-full wp-image-82" title="Screen shot 2010-09-14 at 5.04.31 PM" src="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-5.04.31-PM.png" alt="" width="323" height="340" /></a></p>
<p>Voila! We got a message broadcast to all users in the aimhandles.cf<br />
including me.</p>
<p><a href="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-6.36.21-PM.png"><img class="aligncenter size-full wp-image-110" title="Screen shot 2010-09-14 at 6.36.21 PM" src="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-6.36.21-PM.png" alt="" width="843" height="47" /></a></p>
<p>We can address multiple users as long as we address them with their original<br />
nickname. So we can maintain multiple conversations in a single window as<br />
such:</p>
<p><a href="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-6.36.47-PM.png"><img class="aligncenter size-full wp-image-111" title="Screen shot 2010-09-14 at 6.36.47 PM" src="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-6.36.47-PM.png" alt="" width="846" height="56" /></a></p>
<p>And on the client side our customer sees the message funneled in from AIM:</p>
<p><a href="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-5.05.06-PM1.png"><img class="aligncenter size-full wp-image-112" title="Screen shot 2010-09-14 at 5.05.06 PM" src="http://www.lightcrest.com/blog/wp-content/uploads/2010/09/Screen-shot-2010-09-14-at-5.05.06-PM1.png" alt="" width="345" height="345" /></a></p>
<p>And there you have it. Customer service tools that allow your team to effectively sell<br />
from any device that AIM can run on.</p>
<p>Until next time!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lightcrest.com/blog/?feed=rss2&amp;p=35</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scaling Search:  A Series (Post 1)</title>
		<link>http://www.lightcrest.com/blog/?p=23</link>
		<comments>http://www.lightcrest.com/blog/?p=23#comments</comments>
		<pubDate>Fri, 10 Sep 2010 20:18:21 +0000</pubDate>
		<dc:creator>Michael Hughes</dc:creator>
				<category><![CDATA[Enterprise Search]]></category>
		<category><![CDATA[Managed Hosting]]></category>

		<guid isPermaLink="false">http://www.lightcrest.com/blog/?p=23</guid>
		<description><![CDATA[This is the first &#8220;real&#8221; post in the Lightcrest company blog.  My name is Michael Hughes, one of the principals here at Lightcrest.  Our blog will hopefully shed a bit of light into the day to day operations of our &#8230; <a href="http://www.lightcrest.com/blog/?p=23">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>This is the first &#8220;real&#8221; post in the Lightcrest company blog.  My name is Michael Hughes, one of the principals here at Lightcrest.  Our blog will hopefully shed a bit of light into the day to day operations of our company as well as dive into some technical aspects of what we do for our clients.  Since information technology is such a large and diverse field to work in, a lot of what we do can seem somewhat opaque and mysterious to people on the street (as it were), so this is our attempt at clarifying for our friends, colleagues, customers, and the rest of the world what we do on a day to day basis.  Zach Fierstadt, another Lightcrest principal, will also be contributing to this blog, as will as some of the great folks on the technical and sales teams here at Lightcrest.</p>
<p><span id="more-23"></span></p>
<p>This first series of posts by myself will be an introductory overview of how we help high-volume search engines to scale larger and faster.  Many data-intensive organizations generate huge volumes of new information on a daily basis.  That data must be continually processed, indexed, and searched from all levels of the organizations (and sometimes from the outside as well).  In addition to adding fresh documents and data, old or deprecated data must be reliably removed or blacklisted from the index.  This ongoing process relies on business-defined criteria and is different for every organization.  For example, a real-time Twitter indexing service may require continuous updating and the immediate removal of any articles older than one hour, whereas an online retailer may need to update their product catalog index with product removals, additions, and price updates on a dynamic schedule.  Some archive indexes never remove data; instead, they continuously grow and must be scaled in a sustainable matter that aligns with business requirements.</p>
<p>Once an architecture has been implemented and operational rhythm has been established, the challenge of scaling those operations in sympathy with user demand begins.  In order to better describe the solutions to scaling a search engine, it helps to have a basic understanding of the general workings of most enterprise search systems.  Once we understand what individual components are ultimately responsible for, and how they talk to each other, we then have a clearly defined path of improving performance in each of those areas.  These fundamentals are shared by many of both the smallest and largest search engines, as well as many systems in between.</p>
<p>Conceptually, search engines consist of three parts, all working in unison to provide the kind of search capability we all know today.  I&#8217;ll describe and provide one strategy to scale each of the components listed below.  Our reference platform will be Solr/Lucene, a widely used open source search package written in Java.  However, these fundamental concepts will also map onto proprietary search platforms such as Microsoft FAST ESP.</p>
<ol>
<li>Data retrieval and processing</li>
<li>Data indexing and storage</li>
<li>Query processing and result retrieval</li>
</ol>
<p>These 3 points will be covered in a series of 3 blog posts, the first being the one you are currently reading.</p>
<p>It should be emphasized that search engines do not require that they be scaled in the order that these modules are presented.  For example, you may want to increase your query capacity long before (or after!) augmenting capacity for adding more data to your indexing or processing system.  Likewise, you may have a very low query capacity requirement while faced with a large data collection task (for example, indexing every finance news website on the internet).</p>
<p><strong>Data Retrieval and Processing</strong></p>
<p>Described simply, data processing is the process and mechanism behind discovering and determining which documents ultimately belong in your index.  This may require retrieving and sifting through large amounts of potential documents (for example, from the public web or large ERP system) and discarding many of them based on business rules or other such relevancy criteria.  It could also mean a straight pull out of a database, indexing every single record that exists for the purpose of improved latency on searching and retrieving.  There are many different sources an enterprise may source their data from.  Here are a few examples:</p>
<ul>
<li>Public web sites (e.g., CNN.com, NYTimes.com)</li>
<li>Internal databases (e.g., A Human Resources database with employee information)</li>
<li>Public feeds (e.g., RSS feeds from industry-specific blogs)</li>
<li>Private feeds (e.g., XML from a sales contact/lead generation tool)</li>
</ul>
<p>Generally the interfaces to these types of data sources are well-known and standards-based.  Heavy data interchange is increasingly common and governed by industry standards which allow for quick and transparent retrieval of data from these sources.  Most enterprise search platforms come with scalable web crawlers that can be expanded at will, limited only by hardware and network capacity.  Furthermore, any source which operates over HTTP (like most RSS and XML feeds) can leveraged quickly with a bit of custom feed processing code.</p>
<p>Scaling the operations of the data collection process generally involves adding more worker threads to perform more tasks in parallel.  With some careful tuning, high amounts of throughput can be achieved through multiple sources without overloading any one source.</p>
<p>The data processing mechanism must be scaled to accommodate the additional information coming from the crawlers/retrievers.  With today&#8217;s multicore processors and fast I/O and network infrastructure, this is generally not difficult and usually involves allocating more processing power (computers) to do any linguistics processing and ETL work.  Many commercial platforms such as FAST ESP ship with document processing frameworks that support the dynamic addition and removal of processing machines.  Coordination through middleware is transparent and allows for rapid expansion of capacity.</p>
<p>Furthermore, most search platforms ship with &#8220;connectors&#8221;, or software that communicates with other platforms.  Connector support exists for most standard communication protocols (such as LDAP, ODBC, public HTTP) as well as more esoteric protocols that may exist within certain commercial software domains like custom ERP solutions.  In addition, most search platforms support API access to their processing and indexing systems so that your organization may author bespoke connectors for your particular application.</p>
<p>So, once you have defined your data sources and proven your ability to reliably access information contained within them, you&#8217;re ready to move onto the next stage, data indexing.  That will be covered in the next post!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lightcrest.com/blog/?feed=rss2&amp;p=23</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Welcome to the Wavelengths blog.</title>
		<link>http://www.lightcrest.com/blog/?p=1</link>
		<comments>http://www.lightcrest.com/blog/?p=1#comments</comments>
		<pubDate>Wed, 08 Sep 2010 20:56:56 +0000</pubDate>
		<dc:creator>Lightcrest</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.lightcrest.com/blog/?p=1</guid>
		<description><![CDATA[Thank you for visiting Wavelengths, the Lightcrest company blog.  In the coming days we hope to bring you additional information about our company, our employees, and the work we are doing for our clients in the public and private sectors. &#8230; <a href="http://www.lightcrest.com/blog/?p=1">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Thank you for visiting Wavelengths, the Lightcrest company blog.  In the coming days we hope to bring you additional information about our company, our employees, and the work we are doing for our clients in the public and private sectors.</p>
<p>Thanks for visiting &#8211; be sure to take a look around and check out the great services we offer.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.lightcrest.com/blog/?feed=rss2&amp;p=1</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

