Indexing PDF documents using sunspot / solr seemed like a difficult task initially. There are a number of tutorials dealing with this topic but they seemed overly complicated. In the end, I decided the simplest way to go about providing search functionality for indexed PDF documents was to
1) parse the documents locally
2) input the raw PDF text into an ActiveRecord Object.
3) index the associated raw text field and make it searchable
This allows us to skip the unnecessary step of having Solr index the PDF document, and instead indexes the raw PDF text – which is really the only thing we need.
The configuration is as follows
gem sunspot_rails gem pdf-reader group:development do gem sunspot_solr end #we're using websolr on Heroku
searchable do text :pdf_contents end
Parsing and automatic indexing.
file = open(url_to_pdf_document) #if you are downloading the file, otherwise skip reader = PDF::Reader.new(File.open(file,"rb")) contents = "" reader.pages.each do |page| #remove all newlines and extraneous white spaces from the raw content contents += page.text.gsub("\\n","").gsub(/\\s+/," ").strip end object.pdf_contents = contents object.save
The PDF document is now fully indexed and searchable within the rails app.
This post does not cover making the PDFs accessible through your App – this will depend on your specific situation. The steps above are storage agnostic and will work with any configuration (s3, local, other cloud service, etc).
I ended up doing the same thing. Tried using Yomu (an Apache Tika wrapper), but 50% of my PDFs didn’t parse well. pdf-reader seemed to be fine with any pdf I threw at it. So, I’ll probably use pdf-reader for pdfs and Tika for Word and other docs
Where are you putting the pdf_reader code? I can’t get this to work