Source for file Hyphenator.php
Documentation is available at Hyphenator.php
* $Id: Hyphenator.php 1114 2009-07-10 08:48:44Z heiglandreas $
* Copyright (c) 2008-2009 Andreas Heigl<andreas@heigl.org>
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* @package Org_Heigl_Hyphenator
* @author Andreas Heigl <andreas@heigl.org>
* @copyright 2008 Andreas Heigl<andreas@heigl.org>
* @license http://www.opensource.org/licenses/mit-license.php MIT-License
* @version SVN: $Revision: 1114 $
* This class implements word-hyphenation
* Word-hyphenation is implemented on the basis of the algorithms developed by
* Franklin Mark Liang for LaTeX as described in his dissertation at the department
* of computer science at stanford university.
* This package is based on an idea of Mathias Nater<mnater@mac.com> who
* implemented this word-hyphenation-algorithm for javascript.
* Hyphenating means in this case, that all possible hypheantions in a word are
* marked using the soft-hyphen character (ASCII-Caracter 173) or any other
* character set via the setHyphen() method.
* A complete text will first be divided into words via a regular expression
* that takes all characters that the \w-Special-Character specifies as well as
* the '@'-Character and possible other - language-specific - characters that
* can be set via the setSpecialChars() method.
* Hyphenation is done using a set of files taken from a current TeX-Distribution
* that are matched using the method getTexFile().
* So here is an example for the usage of the class:
* $hyphenator = Org_Heigl_Hyphenator::getInstance ( 'de' );
* $hyphenator -> setHyphen ( '-' )
* // Minimum 5 characters before the first hyphenation
* // Hyphenate only words with more than 4 characters
* // Set some special characters
* -> setSpecialChars ( 'äöüß' )
* // Only Hyphenate with the best quality
* -> setQuality ( Org_Heigl_Hyphenate::QUALITY_BEST )
* // Words that shall not be hyphenated have to start with this string
* -> setNoHyphenateMarker ( 'nbr:' )
* // Words that contain this string are custom hyphenated
* -> setCustomHyphen ( '--' );
* // Hyphenate the string $string
* $hyphenated = $hyphenator -> hyphenate ( $text );
* @package Org_Heigl_Hyphenator
* @author Andreas Heigl <a.heigl@wdv.de>
* @copyright 2008-2010 Andreas Heigl
* @license http://www.opensource.org/licenses/mit-license.php MIT-License
* @version SVN: $Revision: 1114 $
* @see http://code.google.com/p/hyphenator
* @see http://www.tug.org/docs/liang/liang-thesis.pdf
const QUALITY_BETTER =
3;
const QUALITY_NORMAL =
5;
const QUALITY_POREST =
9;
* This is the default language to use.
* @var string $_defaultLanguage
private static $_defaultLanguage =
'en';
* This property stores an instance of the hyphenator for each language
private static $_store =
array ();
* Store the caching-Object
* @var Zend_Cache $_cache
private static $_cache =
null;
* Store whether caching is enabled or not
* Caching is turned off by default
* @var boolean $_cachingEnabled
private $_cachingEnabled =
false;
* The String that marks a word not to hyphenate
* @var string _noHyphenateString
private $_noHyphenateString =
null;
* This property defines the default hyphenation-character.
* This is set during instantiation to the Soft-Hyphen-Character (ASCII 173)
* but can be overwritten using the setHyphen()-Method
* This property defines how many characters need to stay to the left side
* This defaults to 2 characters, but it can be overwritten using the
* This property defines how many characters need to stay to the right side
* This defaults to 2 characters, but it can be overwritten using the
* Whether to mark Customized Hyphenations or not.
* @var boolean $_markCustomized
private $_markCustomized =
false;
* When customizations shall be used, what string shall be prepend to the
* word that contains customizations.
* @var string|null$_customizedMarker
private $_customizedMarker =
'<!--cm-->';
* The shortest pattern length to use for Hyphenating
* @var int $_shortestPattern
private $_shortestPattern =
2;
* The longest pattern length to use for hyphenating.
* Using a high number (like '10') almost every pattern should be used
* @var int $_longestPattern
private $_longestPattern =
10;
* This property defines some spechial Characters for a language that need
* to be taken into account for the definition of a word.
* @var string $_specialChars
private $_specialChars =
'';
* This property defines, how long a word that can be hyphenated needs to be.
* This defaults to 6 Characters, but it can be overridden using
* This property contains the pattern-array for a specific language
* @var array|null$_pattern
private $_pattern =
null;
* The currently set quality for hyphenation
* The lower the number, the better the hyphenation is
* The String that shall be searched for as a customHyphen
* @var string $_customHyphen
private $_customHyphen =
'--';
* This is the static way of hyphenating a string.
* This method gets the appropriate Hyphenator-object and calls the method
* @param string $string The String to hyphenate
* @param string $options The Options to use for Hyphenation
* @return string The hyphenated string
public static function parse ( $string, $options =
null ) {
if ( null ===
$options ) {
if ( ! isset
( $options [ 'language' ] ) ) {
// Get the instance for the language.
unset
( $options['language'] );
foreach ( $options as $key =>
$val ) {
// Hyphenate the string using the Hyphenator instance.
$string =
$hyphenator -> hyphenate ( $string );
// Return the hyphenated string.
* Set the default Language
* @param string $language The Lanfuage to set.
* Get the default language
return Org_Heigl_Hyphenator::$_defaultLanguage;
* This method gets the hyphenator-instance for the language <var>$language</var>
* If no instance exists, it is created and stored.
* @param string $language The language to use for hyphenating
* @return Org_Heigl_Hyphenator A Hyphenator-Object
* @throws InvalidArgumentException
public static function getInstance ( $language =
'en' ) {
$file =
dirname ( __FILE__
)
throw
new InvalidArgumentException( 'file ' .
$language .
'.php does not exist' );
( ! Org_Heigl_Hyphenator::$_store[$language] instanceof
Org_Heigl_Hyphenator ) ) {
* This method parses a TEX-Hyphenation file and creates the appropriate
* @param string $file The original TEX-File
* @param string $parsedFile The PHP-File to be created
public static function parseTexFile ( $file, $parsedFile ) {
$fc =
file_get_contents ( $file );
if ( ! preg_match ( '/[\\n\\r]\\\\patterns\\{(.*)\\}\\s*\\\\/sim', $fc, $array ) ) {
$fc =
preg_replace ( array('/"a/', '/"o/', '/"u/', '/\\./' ), array ( 'ä', 'ö', 'ü', '_' ), $fc );
$fh =
fopen ( $parsedFile, 'w+' );
* Copyright (c) 2008-2010 Andreas Heigl<andreas@heigl.org>
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* This file has been automaticly created from the file ' .
basename ( $file ) .
'
* via the method Org_Heigl_Hyphenator::parseTexFile().
* DO NOT EDIT THIS FILE EXCEPT YOU KNOW WHAT YOU DO!!
* @package Org_Heigl_Hyphenator
* @subpackage HyphenationFiles
* @author Org_Heigl_Hyphenator
* @copyright 2008-2010 Andreas Heigl<andreas@heigl.org>
* @license http://www.opensource.org/licenses/mit-license.php MIT-License
* @since ' .
date ( 'd.m.Y' ) .
'
foreach ( $array as $pattern ) {
if ( strpos ( $pattern, '\\' ) !==
false ) {
$strlen =
strlen ( $pattern );
for ( $i =
0; $i <
$strlen; $i++
) {
if ( ( ( $i ) <=
$strlen ) &&
preg_match ( '/[0-9]/', substr ( $pattern, $i, 1 ) ) ) {
$patternint .=
substr ( $pattern, $i, 1 );
if ( $patternstring !=
'' ) {
fwrite ( $fh, '$pattern[\'' .
$patternstring .
'\'] = \'' .
$patternint .
'\';' .
"\n" );
* This method returns the name of a TeX-Hyphenation file to a language code
* @param string $language The language code to get the to use
$files =
array ( 'ba' =>
'bahyph.tex',
'de_OLD' =>
'dehypht.tex',
return $files[$language];
* Set an instance of Zend_Cache as Caching-Backend.
* @param Zend_Cache $cache The caching Backend
* @link http://framework.zend.com/zend.cache.html
public static function setCache ( Zend_Cache $cache ) {
* This is the constructor, that initialises the hyphenator for the given
* language <var>$language</var>
* This constructor is declared private to ensure, that it is only called
* via the getInstance() method, so we only initialize the stuff only once
* @param string $language The language to use for hyphenating
$lang =
array ( $language );
$pos =
strpos ( '_', $language );
$lang [] =
substr ( $language, 0, $pos );
foreach ( $lang as $language ) {
$this -> _language =
$language;
include_once $parsedFile;
} catch
( Exception $e ) {
throw
new Exception ( 'File \'' .
$parsedFile .
'\' could not be found' );
$this -> _pattern =
$pattern;
if ( null ===
$this -> _hyphen ) {
$this -> _hyphen =
chr ( 173 );
* This method does the actual hyphenation.
* The given <var>$string</var> is splitted into chunks (i.e. Words) at
* After that every chunk is hyphenated and the array of chunks is merged
* into a single string using blanks again.
* This method does not take into account other word-delimiters than blanks
* (eg. returns or tabstops) and it will fail with texts containing markup
* @param string $string The string to hyphenate
* @return string The hyphenated string
$this -> _rawWord =
array ();
// If caching is enabled and the string is already cached, return the
if ( false !==
$result ) {
$size =
count ( $array );
for ( $i =
0; $i <
$size; $i++
) {
$hyphenatedString =
implode ( ' ', $array );
// If caching is enabled, write the hyphenated string to the cache.
$this -> cacheWrite ( $string, $hyphenatedString );
// Return the hyphenated string.
return $hyphenatedString;
* This method hyphenates a single word
* @param string $word The Word to hyphenate
* @return string the hyphenated word
// If the Word is empty, return an empty string.
if ( '' ===
trim ( $word ) ) {
// Replace a string that marks strings not to be hyphenated with an
// empty string. Also replace all custom hyphenations, as the word shall
// Finaly return the word 'as is'.
if ( ( null !==
$this -> _noHyphenateString ) &&
( 0 ===
strpos ( $word, $this -> _noHyphenateString ) ) ) {
$string =
str_replace ( $this -> _noHyphenateString, '', $word );
$string =
str_replace ( $this -> _customHyphen, '', $string );
if ( null !==
$this -> _customizedMarker &&
true ===
$this -> _markCustomized ) {
// If the length of the word is smaller than the minimum word-size,
if ( $this -> _wordMin >
strlen ( $word ) ) {
// Character 173 is the unicode char 'Soft Hyphen' wich may not be
// visible in some editors!
// HTML-Entity for soft hyphenation is ­!
if ( false !==
strpos ( $word, '­' ) ) {
return str_replace ( '­', $this -> _hyphen, $word );
// Replace a custom hyphenate-string with the hyphen.
if ( ( null !==
$this -> _customHyphen ) &&
( false !==
strpos ( $word, $this -> _customHyphen ) ) ) {
$string =
str_replace ( $this -> _customHyphen, $this -> _hyphen, $word );
if ( null !==
$this -> _customizedMarker &&
true ===
$this -> _markCustomized ) {
// If the word already contains a hyphen-character, we assume it is
// already hyphenated and return the word 'as is'.
if ( false !==
strpos ( $word, $this -> _hyphen ) ) {
$breakPos =
strpos ( $word, '-/-' );
if ( false !==
strpos ( $word, '-/-' ) ) {
// Word contains '-/-', so put a zerowidthspace after it and hyphenate
// the parts separated with '-'.
$counter =
count ( $parts );
for ( $i =
0; $i <
$counter; $i++
) {
if ( false !==
strpos ( $word, '-' ) ) {
// Word contains '-', so put a zerowidthspace after it and hyphenate
// the parts separated with '-'.
$counter =
count ( $parts );
for ( $i =
0; $i <
$counter; $i++
) {
// And Finally the core hyphenation algorithm.
$specials =
'\.\:\-\,\;\!\?\/\\\(\)\[\]\{\}\"\'\+\*\#\§\$\%\&\=\@';
// If a special character occurs in the middle of the word, simply
// return the word AS IS.
if ( preg_match ( '/[^' .
$specials .
']['.
$specials.
'][^'.
$specials.
']/', $word ) ) {
if ( preg_match ( '/(['.
$specials.
']*)([^' .
$specials .
']+)(['.
$specials.
']*)/', $word, $result ) ) {
for ( $i =
0; $i <
$wl; $i++
) {
for ( $s =
0; $s <
$wl -
1; $s++
) {
for ( $l =
$this -> _shortestPattern; $l <=
$maxl &&
$l <=
$this -> _longestPattern; $l++
) {
$part =
substr ( $window, 0, $l );
// We found a pattern for this part.
$values = (string)
$this -> _pattern [$part];
for ( $p =
0; $p <
$m; $p++
) {
$v =
substr ( $values, $p, 1 );
$arrayKey =
$i +
$p -
$corrector;
if ( array_key_exists ( $arrayKey, $positions) &&
( ( (int)
$v >
$positions[$arrayKey] ) ) &&
( (int)
$v <=
$this -> _quality ) ) {
$positions[$arrayKey] = (int)
$v;
for ( $i =
1; $i <
$wl; $i++
) {
// If the integer on position $i is higher than 0 and is odd,
// we can hyphenate at that position if the integer is lower or
// equal than the set quality-level.
// Additionaly we check whether the left and right margins are met.
if ( ( 0 !==
$positions[$i] ) &&
( 1 ===
( $positions[$i] %
2 ) ) &&
// FIXME: This prohibits Hyphenation-Quality
// ( $positions[$i] <= $this -> _quality ) &&
( $i >=
$this -> _leftMin ) &&
( $i <=
( strlen ( $word ) -
$this -> _rightMin ) ) ) {
$sylable =
substr ( $word, $lastOne, $i -
$lastOne );
$result [] =
substr ( $word, $lastOne );
return $prepend .
trim ( implode ( $this -> _hyphen, $result ) ) .
$append;
* This method sets the Hyphenation-Character.
* @param string $char The Hyphenation Character
* @return Org_Heigl_Hyphenator Provides fluent Interface
$this -> _hyphen = (string)
$char;
* Get the hyphenation character
* This method sets the minimum Characters, that have to stay to the left of
* @param int $count The left minimum
* @return Org_Heigl_Hyphenator Provides fluent Interface
$this -> _leftMin = (int)
$count;
* This method sets the minimum Characters, that have to stay to the right of
* @param int $count The minimmum characters
* @return Org_Heigl_Hyphenator Provides fluent Interface
$this -> _rightMin = (int)
$count;
* This method sets the minimum Characters a word has to have before being
* @param int $count The minimmum characters
* @return Org_Heigl_Hyphenator Provides fluent Interface
$this -> _wordMin = (int)
$count;
* This method sets the special Characters for a specified language
* @param string $chars The spechail characters
* @return Org_Heigl_Hyphenator Provides fluent Interface
$this -> specialChars =
$chars;
* Enable or disable caching of hyphenated texts
* @param boolean $caching Whether to enable caching or not. Defaults to
* @return Org_Heigl_Hyphenator
$this -> _cachingEnabled = (bool)
$caching;
* Check whether caching is enabled or not
return (bool)
$this -> _cachingEnabled;
* Write <var>string</var> to the cache.
* <var>string</var> can be retrieved using <var>key</var>
* @param string $key The key under which the string can be found in the cache
* @param string $string The string to cache
* @return Org_Heigl_Hyphenator
if ( false ===
$this -> cacheRead ( $key ) ) {
$cache -> save ( $string, $key );
* Get the cached string to a key
* @param string $key The key to return a string to
$result =
$cache -> load ( $key );
* Set the quality that the Hyphenation needs to have minimum
* The lower the number, the better is the quality
* @param int $quality The quality-level to set
* @return Org_Heigl_Hyphenator
$this -> _quality = (int)
$quality;
* Set a string that will be replaced with the soft-hyphen before
* Hyphenation actualy starts.
* If this string is found in a word no hyphenation will be done except for
* the place where the custom hyphen has been found
* @param string $customHyphen The Custom Hyphen to set
* @return Org_Heigl_Hyphenator
$this -> _customHyphen =
$customHyphen;
* Set a string that marks a words not to hyphenate
* @param string $marker THe Marker that marks a word
* @return Org_Heigl_Hyphenator
$this -> _noHyphenateString =
$marker;
* Get the marker for custom hyphenations
return (string)
$this -> _customHyphen;
* Get the marker for Words not to hyphenate
return (string)
$this -> _noHyphenateString;
* Set and retrieve whether or not to mark custom hyphenations
* This method always returns the current setting, so you can set AND
* retrieve the value with this method.
* @param null|booelan$mark Whether or not to mark
$this -> _markCustomized = (bool)
$mark;
return (bool)
$this -> _markCustomized;
* Set the string that shall be prepend to a customized word.
* @param string $marker The Marker to set
* @return Org_Heigl_Hyphenator
$this -> _customizedMarker = (string)
$marker;
* Get the string that shall be prepend to a customized word.
return (string)
$this -> _customizedMarker;
Documentation generated on Mon, 07 Jun 2010 12:01:07 +0200 by phpDocumentor 1.4.3