Apache Solr - DataImportHandler Release Notes Introduction ------------ DataImportHandler is a data import tool for Solr which makes importing data from Databases, XML files and HTTP data sources quick and easy. $Id$ ================== 3.1.0 ================== Upgrading from Solr 1.4 ---------------------- Versions of Major Components --------------------- Detailed Change List ---------------------- New Features ---------------------- * SOLR-1525 : allow DIH to refer to core properties (noble) * SOLR-1547 : TemplateTransformer copy objects more intelligently when there when the template is a single variable (noble) * SOLR-1627 : VariableResolver should be fetched just in time (noble) * SOLR-1583 : Create DataSources that return InputStream (noble) * SOLR-1358 : Integration of Tika and DataImportHandler ( Akshay Ukey, noble) * SOLR-1654 : TikaEntityProcessor example added DIHExample (Akshay Ukey via noble) * SOLR-1678 : Move onError handling to DIH framework (noble) * SOLR-1352 : Multi-threaded implementation of DIH (noble) * SOLR-1721 : Add explicit option to run DataImportHandler in synchronous mode (Alexey Serba via noble) * SOLR-1737 : Added FieldStreamDataSource (noble) Optimizations ---------------------- * SOLR-2200: Improve the performance of DataImportHandler for large delta-import updates. (Mark Waddle via rmuir) Bug Fixes ---------------------- * SOLR-1638: Fixed NullPointerException during import if uniqueKey is not specified in schema (Akshay Ukey via shalin) * SOLR-1639: Fixed misleading error message when dataimport.properties is not writable (shalin) * SOLR-1598: Reader used in PlainTextEntityProcessor is not explicitly closed (Sascha Szott via noble) * SOLR-1759: $skipDoc was not working correctly (Gian Marco Tagliani via noble) * SOLR-1762: DateFormatTransformer does not work correctly with non-default locale dates (tommy chheng via noble) * SOLR-1757: DIH multithreading sometimes throws NPE (noble) * SOLR-1766: DIH with threads enabled doesn't respond to the abort command (Michael Henson via noble) * SOLR-1767: dataimporter.functions.escapeSql() does not escape backslash character (Sean Timm via noble) * SOLR-1811: formatDate should use the current NOW value always (Sean Timm via noble) * SOLR-1794: Dataimport of CLOB fields fails when getCharacterStream() is defined in a superclass. (Gunnar Gauslaa Bergem via rmuir) * SOLR-2057: DataImportHandler never calls UpdateRequestProcessor.finish() (Drew Farris via koji) * SOLR-1973: Empty fields in XML update messages confuse DataImportHandler. (koji) * SOLR-2221: Use StrUtils.parseBool() to get values of boolean options in DIH. true/on/yes (for TRUE) and false/off/no (for FALSE) can be used for sub-options (debug, verbose, synchronous, commit, clean, optimize) for full/delta-import commands. (koji) * SOLR-2310: getTimeElapsedSince() returns incorrect hour value when the elapse is over 60 hours (tom liu via koji) * SOLR-2252: When a child entity in nested entities is rootEntity="true", delta-import doesn't work. (koji) * SOLR-1191: resolve DataImportHandler deltaQuery column against pk when pk has a prefix (e.g. pk="book.id" deltaQuery="select id from ..."). More useful error reporting when no match found (previously failed with a NullPointerException in log and no clear user feedback). (gthb via yonik) * SOLR-2116: Fix TikaConfig classloader bug in TikaEntityProcessor (Martijn van Groningen via hossman) Other Changes ---------------------- * SOLR-1821: Fix TimeZone-dependent test failure in TestEvaluatorBag. (Chris Male via rmuir) * SOLR-2367: Reduced noise in test output by ensuring the properties file can be written. (Gunnlaugur Thor Briem via rmuir) Build ---------------------- Documentation ---------------------- ================== Release 1.4.0 ================== Upgrading from Solr 1.3 ----------------------- Evaluator API has been changed in a non back-compatible way. Users who have developed custom Evaluators will need to change their code according to the new API for it to work. See SOLR-996 for details. The formatDate evaluator's syntax has been changed. The new syntax is formatDate(, ''). For example, formatDate(x.date, 'yyyy-MM-dd'). In the old syntax, the date string was written without a single-quotes. The old syntax has been deprecated and will be removed in 1.5, until then, using the old syntax will log a warning. The Context API has been changed in a non back-compatible way. In particular, the Context.currentProcess() method now returns a String describing the type of the current import process instead of an int. Similarily, the public constants in Context viz. FULL_DUMP, DELTA_DUMP and FIND_DELTA are changed to a String type. See SOLR-969 for details. The EntityProcessor API has been simplified by moving logic for applying transformers and handling multi-row outputs from Transformers into an EntityProcessorWrapper class. The EntityProcessor#destroy is now called once per parent-row at the end of row (end of data). A new method EntityProcessor#close is added which is called at the end of import. In Solr 1.3, if the last_index_time was not available (first import) and a delta-import was requested, a full-import was run instead. This is no longer the case. In Solr 1.4 delta import is run with last_index_time as the epoch date (January 1, 1970, 00:00:00 GMT) if last_index_time is not available. Detailed Change List ---------------------- New Features ---------------------- 1. SOLR-768: Set last_index_time variable in full-import command. (Wojtek Piaseczny, Noble Paul via shalin) 2. SOLR-811: Allow a "deltaImportQuery" attribute in SqlEntityProcessor which is used for delta imports instead of DataImportHandler manipulating the SQL itself. (Noble Paul via shalin) 3. SOLR-842: Better error handling in DataImportHandler with options to abort, skip and continue imports. (Noble Paul, shalin) 4. SOLR-833: A DataSource to read data from a field as a reader. This can be used, for example, to read XMLs residing as CLOBs or BLOBs in databases. (Noble Paul via shalin) 5. SOLR-887: A Transformer to strip HTML tags. (Ahmed Hammad via shalin) 6. SOLR-886: DataImportHandler should rollback when an import fails or it is aborted (shalin) 7. SOLR-891: A Transformer to read strings from Clob type. (Noble Paul via shalin) 8. SOLR-812: Configurable JDBC settings in JdbcDataSource including optimized defaults for read only mode. (David Smiley, Glen Newton, shalin) 9. SOLR-910: Add a few utility commands to the DIH admin page such as full import, delta import, status, reload config. (Ahmed Hammad via shalin) 10.SOLR-938: Add event listener API for import start and end. (Kay Kay, Noble Paul via shalin) 11.SOLR-801: Add support for configurable pre-import and post-import delete query per root-entity. (Noble Paul via shalin) 12.SOLR-988: Add a new scope for session data stored in Context to store objects across imports. (Noble Paul via shalin) 13.SOLR-980: A PlainTextEntityProcessor which can read from any DataSource and output a String. (Nathan Adams, Noble Paul via shalin) 14.SOLR-1003: XPathEntityprocessor must allow slurping all text from a given xml node and its children. (Noble Paul via shalin) 15.SOLR-1001: Allow variables in various attributes of RegexTransformer, HTMLStripTransformer and NumberFormatTransformer. (Fergus McMenemie, Noble Paul, shalin) 16.SOLR-989: Expose running statistics from the Context API. (Noble Paul, shalin) 17.SOLR-996: Expose Context to Evaluators. (Noble Paul, shalin) 18.SOLR-783: Enhance delta-imports by maintaining separate last_index_time for each entity. (Jon Baer, Noble Paul via shalin) 19.SOLR-1033: Current entity's namespace is made available to all Transformers. This allows one to use an output field of TemplateTransformer in other transformers, among other things. (Fergus McMenemie, Noble Paul via shalin) 20.SOLR-1066: New methods in Context to expose Script details. ScriptTransformer changed to read scripts through the new API methods. (Noble Paul via shalin) 21.SOLR-1062: A LogTransformer which can log data in a given template format. (Jon Baer, Noble Paul via shalin) 22.SOLR-1065: A ContentStreamDataSource which can accept HTTP POST data in a content stream. This can be used to push data to Solr instead of just pulling it from DB/Files/URLs. (Noble Paul via shalin) 23.SOLR-1061: Improve RegexTransformer to create multiple columns from regex groups. (Noble Paul via shalin) 24.SOLR-1059: Special flags introduced for deleting documents by query or id, skipping rows and stopping further transforms. Use $deleteDocById, $deleteDocByQuery for deleting by id and query respectively. Use $skipRow to skip the current row but continue with the document. Use $stopTransform to stop further transformers. New methods are introduced in Context for deleting by id and query. (Noble Paul, Fergus McMenemie, shalin) 25.SOLR-1076: JdbcDataSource should resolve variables in all its configuration parameters. (shalin) 26.SOLR-1055: Make DIH JdbcDataSource easily extensible by making the createConnectionFactory method protected and return a Callable object. (Noble Paul, shalin) 27.SOLR-1058: JdbcDataSource can lookup javax.sql.DataSource using JNDI. Use a jndiName attribute to specify the location of the data source. (Jason Shepherd, Noble Paul via shalin) 28.SOLR-1083: An Evaluator for escaping query characters. (Noble Paul, shalin) 29.SOLR-934: A MailEntityProcessor to enable indexing mails from POP/IMAP sources into a solr index. (Preetam Rao, shalin) 30.SOLR-1060: A LineEntityProcessor which can stream lines of text from a given file to be indexed directly or for processing with transformers and child entities. (Fergus McMenemie, Noble Paul, shalin) 31.SOLR-1127: Add support for field name to be templatized. (Noble Paul, shalin) 32.SOLR-1092: Added a new command named 'import' which does not automatically clean the index. This is useful and more appropriate when one needs to import only some of the entities. (Noble Paul via shalin) 33.SOLR-1153: 'deltaImportQuery' is honored on child entities as well (noble) 34.SOLR-1230: Enhanced dataimport.jsp to work with all DataImportHandler request handler configurations, rather than just a hardcoded /dataimport handler. (ehatcher) 35.SOLR-1235: disallow period (.) in entity names (noble) 36.SOLR-1234: Multiple DIH does not work because all of them write to dataimport.properties. Use the handler name as the properties file name (noble) 37.SOLR-1348: Support binary field type in convertType logic in JdbcDataSource (shalin) 38.SOLR-1406: Make FileDataSource and FileListEntityProcessor to be more extensible (Luke Forehand, shalin) 39.SOLR-1437 : XPathEntityProcessor can deal with xpath syntaxes such as //tagname , /root//tagname (Fergus McMenemie via noble) Optimizations ---------------------- 1. SOLR-846: Reduce memory consumption during delta import by removing keys when used (Ricky Leung, Noble Paul via shalin) 2. SOLR-974: DataImportHandler skips commit if no data has been updated. (Wojtek Piaseczny, shalin) 3. SOLR-1004: Check for abort more frequently during delta-imports. (Marc Sturlese, shalin) 4. SOLR-1098: DateFormatTransformer can cache the format objects. (Noble Paul via shalin) 5. SOLR-1465: Replaced string concatenations with StringBuilder append calls in XPathRecordReader. (Mark Miller, shalin) Bug Fixes ---------------------- 1. SOLR-800: Deep copy collections to avoid ConcurrentModificationException in XPathEntityprocessor while streaming (Kyle Morrison, Noble Paul via shalin) 2. SOLR-823: Request parameter variables ${dataimporter.request.xxx} are not resolved (Mck SembWever, Noble Paul, shalin) 3. SOLR-728: Add synchronization to avoid race condition of multiple imports working concurrently (Walter Ferrara, shalin) 4. SOLR-742: Add ability to create dynamic fields with custom DataImportHandler transformers (Wojtek Piaseczny, Noble Paul, shalin) 5. SOLR-832: Rows parameter is not honored in non-debug mode and can abort a running import in debug mode. (Akshay Ukey, shalin) 6. SOLR-838: The VariableResolver obtained from a DataSource's context does not have current data. (Noble Paul via shalin) 7. SOLR-864: DataImportHandler does not catch and log Errors (shalin) 8. SOLR-873: Fix case-sensitive field names and columns (Jon Baer, shalin) 9. SOLR-893: Unable to delete documents via SQL and deletedPkQuery with deltaimport (Dan Rosher via shalin) 10. SOLR-888: DateFormatTransformer cannot convert non-string type (Amit Nithian via shalin) 11. SOLR-841: DataImportHandler should throw exception if a field does not have column attribute (Michael Henson, shalin) 12. SOLR-884: CachedSqlEntityProcessor should check if the cache key is present in the query results (Noble Paul via shalin) 13. SOLR-985: Fix thread-safety issue with TemplateString for concurrent imports with multiple cores. (Ryuuichi Kumai via shalin) 14. SOLR-999: XPathRecordReader fails on XMLs with nodes mixed with CDATA content. (Fergus McMenemie, Noble Paul via shalin) 15.SOLR-1000: FileListEntityProcessor should not apply fileName filter to directory names. (Fergus McMenemie via shalin) 16.SOLR-1009: Repeated column names result in duplicate values. (Fergus McMenemie, Noble Paul via shalin) 17.SOLR-1017: Fix thread-safety issue with last_index_time for concurrent imports in multiple cores due to unsafe usage of SimpleDateFormat by multiple threads. (Ryuuichi Kumai via shalin) 18.SOLR-1024: Calling abort on DataImportHandler import commits data instead of calling rollback. (shalin) 19.SOLR-1037: DIH should not add null values in a row returned by EntityProcessor to documents. (shalin) 20.SOLR-1040: XPathEntityProcessor fails with an xpath like /feed/entry/link[@type='text/html']/@href (Noble Paul via shalin) 21.SOLR-1042: Fix memory leak in DIH by making TemplateString non-static member in VariableResolverImpl (Ryuuichi Kumai via shalin) 22.SOLR-1053: IndexOutOfBoundsException in SolrWriter.getResourceAsString when size of data-config.xml is a multiple of 1024 bytes. (Herb Jiang via shalin) 23.SOLR-1077: IndexOutOfBoundsException with useSolrAddSchema in XPathEntityProcessor. (Sam Keen, Noble Paul via shalin) 24.SOLR-1080: RegexTransformer should not replace if regex is not matched. (Noble Paul, Fergus McMenemie via shalin) 25.SOLR-1090: DataImportHandler should load the data-config.xml using UTF-8 encoding. (Rui Pereira, shalin) 26.SOLR-1146: ConcurrentModificationException in DataImporter.getStatusMessages (Walter Ferrara, Noble Paul via shalin) 27.SOLR-1229: Fixes for deletedPkQuery, particularly when using transformed Solr unique id's (Lance Norskog, Noble Paul via ehatcher) 28.SOLR-1286: Fix the commit parameter always defaulting to "true" even if "false" is explicitly passed in. (Jay Hill, Noble Paul via ehatcher) 29.SOLR-1323: Reset XPathEntityProcessor's $hasMore/$nextUrl when fetching next URL (noble, ehatcher) 30.SOLR-1450: Jdbc connection properties such as batchSize are not applied if the driver jar is placed in solr_home/lib. (Steve Sun via shalin) 31.SOLR-1474: Delta-import should run even if last_index_time is not set. (shalin) Documentation ---------------------- 1. SOLR-1369: Add HSQLDB Jar to example-DIH, unzip database and update instructions. Other ---------------------- 1. SOLR-782: Refactored SolrWriter to make it a concrete class and removed wrappers over SolrInputDocument. Refactored to load Evaluators lazily. Removed multiple document nodes in the configuration xml. Removed support for 'default' variables, they are automatically available as request parameters. (Noble Paul via shalin) 2. SOLR-964: XPathEntityProcessor now ignores DTD validations (Fergus McMenemie, Noble Paul via shalin) 3. SOLR-1029: Standardize Evaluator parameter parsing and added helper functions for parsing all evaluator parameters in a standard way. (Noble Paul, shalin) 4. SOLR-1081: Change EventListener to be an interface so that components such as an EntityProcessor or a Transformer can act as an event listener. (Noble Paul, shalin) 5. SOLR-1027: Alias the 'dataimporter' namespace to a shorter name 'dih'. (Noble Paul via shalin) 6. SOLR-1084: Better error reporting when entity name is a reserved word and data-config.xml root node is not . (Noble Paul via shalin) 7. SOLR-1087: Deprecate 'where' attribute in CachedSqlEntityProcessor in favor of cacheKey and cacheLookup. (Noble Paul via shalin) 8. SOLR-969: Change the FULL_DUMP, DELTA_DUMP, FIND_DELTA constants in Context to String. Change Context.currentProcess() to return a string instead of an integer. (Kay Kay, Noble Paul, shalin) 9. SOLR-1120: Simplified EntityProcessor API by moving logic for applying transformers and handling multi-row outputs from Transformers into an EntityProcessorWrapper class. The behavior of the method EntityProcessor#destroy has been modified to be called once per parent-row at the end of row. A new method EntityProcessor#close is added which is called at the end of import. A new method Context#getResolvedEntityAttribute is added which returns the resolved value of an entity's attribute. Introduced a DocWrapper which takes care of maintaining document level session variables. (Noble Paul, shalin) 10.SOLR-1265: Add variable resolving for URLDataSource properties like baseUrl. (Chris Eldredge via ehatcher) 11.SOLR-1269: Better error messages from JdbcDataSource when JDBC Driver name or SQL is incorrect. (ehatcher, shalin) ================== Release 1.3.0 20080915 ================== Status ------ This is the first release since DataImportHandler was added to the contrib solr distribution. The following changes list changes since the code was introduced, not since the first official release. Detailed Change List -------------------- New Features 1. SOLR-700: Allow configurable locales through a locale attribute in fields for NumberFormatTransformer. (Stefan Oestreicher, shalin) Changes in runtime behavior Bug Fixes 1. SOLR-704: NumberFormatTransformer can silently ignore part of the string while parsing. Now it tries to use the complete string for parsing. Failure to do so will result in an exception. (Stefan Oestreicher via shalin) 2. SOLR-729: Context.getDataSource(String) gives current entity's DataSource instance regardless of argument. (Noble Paul, shalin) 3. SOLR-726: Jdbc Drivers and DataSources fail to load if placed in multicore sharedLib or core's lib directory. (Walter Ferrara, Noble Paul, shalin) Other Changes