搜索

分类列表

Android (1)
Antlr (17)
C和CPlusPlus (72)
Delphi (87)
DotNet (12)
Flex (21)
IPhone (28)
Java (35)
Javascript (1)
Linux (20)
PHP (17)
Python (7)
Sqlite (15)
Symbian (2)
Unreal (14)
Vxworks (11)
Web (6)
共享软件 (14)
多媒体 (20)
开源 (10)
投资理财 (2)
数字货币 (3)
数据库 (2)
杂感 (31)
深度学习 (2)
算法 (4)
风花雪月 (8)

Antlr的Unicode支持

由 gougou 于星期三, 2005-08-24 14:11 发表

Antlr

.-noscript-blocked { border: 1px solid red !important; background: white url("chrome://noscript/skin/icon32.png") no-repeat left top !important; opacity: 0.6 !important; }

Antlr从2.72起开始提供了完善的Unicode支持，首先设置charVocabulary选项，指定Unicode范围，注意antlr不支持\uFFFF，因为这个值对应于-1，被antlr用来标示结尾字符了。然后，在写规则的时候，指定Token的Unicode范围，比如从'\u0080'..'\ufffe'，实例如下，注意Antlr好像只能解析unicode编码的文件，至于用GB2312等编码的文件则不支持，因此上述文件需要转换为Unicode文件后才能被解析。

header {
#include "SqliteEntity.h"
#include <antlr/TokenStreamRewriteEngine.hpp>
#include "UnicodeCharBuffer.hpp"
#include "UnicodeCharScanner.hpp"

ANTLR_USING_NAMESPACE(std)
ANTLR_USING_NAMESPACE(antlr)
}

class L extends Lexer("UnicodeCharScanner");

options
{
// Allow any char but \uFFFF (16 bit -1)
// hmmm antlr does not allow \u10FFFE
charVocabulary='\u0000'..'\uFFFE';
noConstructors = true;
}

{
public:
bool done;

//设定为大小写敏感的参数为False

L( std::istream& in )
: UnicodeCharScanner(new UnicodeCharBuffer(in),false)
{
initLiterals();
}
L( UnicodeCharBuffer& ib )
: UnicodeCharScanner(ib,false)
{
initLiterals();
}

void uponEOF()
{
done = true;
}
}

WORD : ( ~(' '|'\n'|'\r'|'\t') )+
;

WS :
(' '
|'\n' { newline(); } ('\r')?
|'\r' { newline(); }
|'\t' { tab(); }
)
{$setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);}
;

protected
ID_START_LETTER
: '$'
| '_'
| 'a'..'z'
| '\u0080'..'\ufffe'
;

protected
ID_LETTER
: ID_START_LETTER
| '0'..'9'
;

注意的是UnicodeCharScanner.hpp中的testLiteralsTable有bug,没有考虑大小写敏感的问题,下面是我修正后的版本.

virtual int testLiteralsTable(int ttype) const
{
string_map::const_iterator i;
if (caseSensitive)
i= literals.find(text);
else
{
string lowText=text;
transform(lowText.begin(), lowText.end(), lowText.begin(), tolower);
i= literals.find(lowText);
}
if (i != literals.end())
ttype = (*i).second;
return ttype;
}

最后，要注意的是一旦解析Unicode的话，使用antlr生成parser的速度会极慢，建议其他部分作完后最后添加unicode的解析。

gougou's blog

Delphi, DotNet, Java, C++深度探索-哈巴狗的小窝

搜索

分类列表

导航

Antlr的Unicode支持

最新blog文章

用户登录

聚合