[daisy] Full-Text Indexer: reindex all documents runs out of memory

Discussion:

Geert Coelmont

2013-12-10 18:22:29 UTC

Hi all,
After a massive cleanup of old documents, we want to do a complete
reindex of all remaining documents in our Daisy repository.
For this, we used the JMX management console, FullTextIndexUpdater,
function reIndexAllDocuments.
This starts OK and for several minutes the "Reindex Status" is "Querying
the repository to retrieve the list of documents to re-index (started at
....)"
After a while, we get the following in the logs:

[ERROR ] <2013-12-10 11:59:06,886>
(org.outerj.daisy.ftindex.FullTextIndexImpl): Error updating fulltext
index writer.
java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.store.IndexInput.readString(IndexInput.java:92)
at
org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:216)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:124)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:333)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:207)
at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:97)
at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1883)
at
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:1811)
at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1742)
at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1733)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:874)
at
org.outerj.daisy.ftindex.FullTextIndexImpl.closeIndexWriter(FullTextIndexImpl.java:330)
at
org.outerj.daisy.ftindex.FullTextIndexImpl.updateWriter(FullTextIndexImpl.java:317)
at
org.outerj.daisy.ftindex.FullTextIndexImpl.access$300(FullTextIndexImpl.java:54)
at
org.outerj.daisy.ftindex.FullTextIndexImpl$IndexFlusher.run(FullTextIndexImpl.java:370)
at java.lang.Thread.run(Thread.java:662)

We do have a lot of documents in the repository (about 400K).
It would seem that the actual reindexing didn't start yet, and Daisy was
still "collecting" the necessary documents.
Is there anything apart from increasing memory that we can do to make
this work more efficiently.
It would already be a help if we could reindex only the most recent
document (e.g. last few thousand).
Thanks in advance,
--
Best regards / Met vriendelijke groeten

*Geert Coelmont*
Headbird

Headbird NV -- ICT services | Sneeuwbeslaan 14 --2610 Antwerpen (BE)
+32 3 829 9047 | ***@headbird.com
**********************************************************************
All e-mail messages addressed to, received or sent by the Cobelfret Group or Cobelfret Group employees are deemed to be professional in nature. Accordingly, the sender or recipient of these messages agrees that they may be read by other Cobelfret Group employees than the official recipient or sender in order to ensure the continuity of work-related activities and allow supervision thereof.

This mail has been checked for viruses by Sophos
*********************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cocoondev.org/pipermail/daisy/attachments/20131210/cd737d73/attachment.htm

Paul Focke

2013-12-10 18:35:09 UTC

Permalink

Geert

Have a look here
http://docs.ngdata.com/daisy-docs-current/373-daisy/18-daisy.html#dsy18-daisy_rebuild_index

There is a jmx method reIndexDocuments that allows you to pass a query as
an argument.

Paul

On Tue, Dec 10, 2013 at 12:21 PM, Geert Coelmont <

Post by Geert Coelmont
Hi all,
After a massive cleanup of old documents, we want to do a complete reindex
of all remaining documents in our Daisy repository.
For this, we used the JMX management console, FullTextIndexUpdater,
function reIndexAllDocuments.
This starts OK and for several minutes the "Reindex Status" is "Querying
the repository to retrieve the list of documents to re-index (started at
....)"
[ERROR ] <2013-12-10 11:59:06,886>
(org.outerj.daisy.ftindex.FullTextIndexImpl): Error updating fulltext index
writer.
java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.store.IndexInput.readString(IndexInput.java:92)
at
org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:216)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:124)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:333)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:207)
at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:97)
at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1883)
at
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:1811)
at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1742)
at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1733)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:874)
at
org.outerj.daisy.ftindex.FullTextIndexImpl.closeIndexWriter(FullTextIndexImpl.java:330)
at
org.outerj.daisy.ftindex.FullTextIndexImpl.updateWriter(FullTextIndexImpl.java:317)
at
org.outerj.daisy.ftindex.FullTextIndexImpl.access$300(FullTextIndexImpl.java:54)
at
org.outerj.daisy.ftindex.FullTextIndexImpl$IndexFlusher.run(FullTextIndexImpl.java:370)
at java.lang.Thread.run(Thread.java:662)
We do have a lot of documents in the repository (about 400K).
It would seem that the actual reindexing didn't start yet, and Daisy was
still "collecting" the necessary documents.
Is there anything apart from increasing memory that we can do to make this
work more efficiently.
It would already be a help if we could reindex only the most recent
document (e.g. last few thousand).
Thanks in advance,
--
Best regards / Met vriendelijke groeten
*Geert Coelmont*
Headbird
Headbird NV ? ICT services | Sneeuwbeslaan 14 ?2610 Antwerpen (BE)
**********************************************************************
All e-mail messages addressed to, received or sent by the Cobelfret Group
or Cobelfret Group employees are deemed to be professional in nature.
Accordingly, the sender or recipient of these messages agrees that they may
be read by other Cobelfret Group employees than the official recipient or
sender in order to ensure the continuity of work-related activities and
allow supervision thereof.
This mail has been checked for viruses by Sophos
*********************************************************************
_______________________________________________
daisy community mailing list
http://outerthought.org/en/services/daisy/support.html
list information: http://lists.cocoondev.org/mailman/listinfo/daisy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cocoondev.org/pipermail/daisy/attachments/20131210/e277e91c/attachment.htm

Geert Coelmont

2013-12-10 19:11:08 UTC

Permalink

Perfect, thanks!

Post by Paul Focke
Geert
Have a look here
http://docs.ngdata.com/daisy-docs-current/373-daisy/18-daisy.html#dsy18-daisy_rebuild_index
There is a jmx method reIndexDocuments that allows you to pass a query
as an argument.
Paul
On Tue, Dec 10, 2013 at 12:21 PM, Geert Coelmont
Hi all,
After a massive cleanup of old documents, we want to do a complete
reindex of all remaining documents in our Daisy repository.
For this, we used the JMX management console,
FullTextIndexUpdater, function reIndexAllDocuments.
This starts OK and for several minutes the "Reindex Status" is
"Querying the repository to retrieve the list of documents to
re-index (started at ....)"
[ERROR ] <2013-12-10 11:59:06,886>
(org.outerj.daisy.ftindex.FullTextIndexImpl): Error updating
fulltext index writer.
java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.store.IndexInput.readString(IndexInput.java:92)
at
org.apache.lucene.index.FieldsReader.addFieldForMerge(FieldsReader.java:216)
at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:124)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:333)
at
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:207)
at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:97)
at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1883)
at
org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:1811)
at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1742)
at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1733)
at
org.apache.lucene.index.IndexWriter.close(IndexWriter.java:874)
at
org.outerj.daisy.ftindex.FullTextIndexImpl.closeIndexWriter(FullTextIndexImpl.java:330)
at
org.outerj.daisy.ftindex.FullTextIndexImpl.updateWriter(FullTextIndexImpl.java:317)
at
org.outerj.daisy.ftindex.FullTextIndexImpl.access$300(FullTextIndexImpl.java:54)
at
org.outerj.daisy.ftindex.FullTextIndexImpl$IndexFlusher.run(FullTextIndexImpl.java:370)
at java.lang.Thread.run(Thread.java:662)
We do have a lot of documents in the repository (about 400K).
It would seem that the actual reindexing didn't start yet, and
Daisy was still "collecting" the necessary documents.
Is there anything apart from increasing memory that we can do to
make this work more efficiently.
It would already be a help if we could reindex only the most
recent document (e.g. last few thousand).
Thanks in advance,

**********************************************************************
All e-mail messages addressed to, received or sent by the Cobelfret Group or Cobelfret Group employees are deemed to be professional in nature. Accordingly, the sender or recipient of these messages agrees that they may be read by other Cobelfret Group employees than the official recipient or sender in order to ensure the continuity of work-related activities and allow supervision thereof.

This mail has been checked for viruses by Sophos
*********************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cocoondev.org/pipermail/daisy/attachments/20131210/60931bcb/attachment-0001.htm