Awarded: ICDevs.org - Bounty #29 - XML Parser - Motoko - $8,000

XML Parser - Motoko - #29

Current Status: Discussion

  • Discussion (01/10/2023)
  • Ratification: (01/09/2023)
  • Open for application: (01/09/2023)
  • Assigned
  • In Review
  • Closed

Official Link

Bounty Details

  • Bounty Amount: $8,000 USD of ICP at award date
  • ICDevs.org Bounty Acceleration: For each 1 ICP sent to e156fe180f0f6deffa87344390dc45b2e6d4483d4007f6ea8f3f4d89e56fa5d2, ICDevs.org will add .25 ICP to this issue and .75 ICP to fund other ICDevs.org initiatives.
  • Project Type: Team
  • Opened: 01/10/2023
  • Time Commitment: Weeks
  • Project Type: Library
  • Experience Type: Intermediate - Motoko;

Description

We have a JSON parser for motoko, but some legacy web apis use XML. This bounty asks you to use the JSON parser patterns and update it to parse(decode) XML. You will need to provide candid types for XML and make sure that you support parsing the XML Spec.

You should also provide an encoder that will take the xml candid type and output the XML as text. This will be useful for building xml based web services.

This bounty gives the opportunity to

To apply for this bounty you should:

  • Include links to previous work writing tutorials and any other open-source contributions(ie. your github).
  • Include a brief overview of how you will complete the task. This can include things like which dependencies you will use, how you will make it self-contained, the sacrifices you would have to make to achieve that, or how you will make it simple. Anything that can convince us you are taking a thoughtful and expert approach to this design.
  • Give an estimated timeline on completing the task.
  • Post your application text to the Bounty Thread

Selection Process

The ICDevs.org developer’s advisors will propose a vote to award the bounty and the Developer Advisors will vote.

Bounty Completion

Please keep your ongoing code in a public repository(fork or branch is ok). Please provide regular (at least weekly) updates. Code commits count as updates if you link to your branch/fork from the bounty thread. We just need to be able to see that you are making progress.

The balance of the bounty will be paid out at completion.

Once you have finished, please alert the dev forum thread that you have completed work and where we can find that work. We will review and award the bounty reward if the terms have been met. If there is any coordination work(like a pull request) or additional documentation needed we will inform you of what is needed before we can award the reward.

Bounty Abandonment and Re-awarding

If you cease work on the bounty for a prolonged(at the Developer Advisory Board’s discretion) or if the quality of work degrades to the point that we think someone else should be working on the bounty we may re-award it. We will be transparent about this and try to work with you to push through and complete the project, but sometimes, it may be necessary to move on or to augment your contribution with another resource which would result in a split bounty.

Funding

The bounty was generously funded by the DFINITY Foundation. If you would like to turbocharge this bounty you can seed additional donations of ICP to e156fe180f0f6deffa87344390dc45b2e6d4483d4007f6ea8f3f4d89e56fa5d2. ICDevs will match the bounty $40:1 ICP for the first 100 ICP out of the DFINITY grant and then 0.25:1. All donations will be tax deductible for US Citizens and Corporations. If you send a donation and need a donation receipt, please email the hash of your donation transaction, physical address, and name to donations@icdevs.org. More information about how you can contribute can be found at our donations page.

FYI: General Bounty Process

Discussion

The draft bounty is posted to the DFINITY developer’s forum for discussion

Ratification: (01/09/2023)

The developer advisor’s board will propose a bounty be ratified and a vote will take place to ratify the bounty. Until a bounty is ratified by the Dev it hasn’t been officially adopted. Please take this into consideration if you are considering starting early.

Open for application

Developers can submit applications to the Dev Forum post. The council will consider these as they come in and propose a vote to award the bounty to one of the applicants. If you would like to apply anonymously you can send an email to austin at icdevs dot org or sending a PM on the dev forum.

Assigned

A developer is currently working on this bounty, you are free to contribute, but any splitting of the award will need to be discussed with the currently assigned developer.

In Review

The Dev Council is reviewing the submission

Awarded

The award has been given and the bounty is closed.

Other ICDevs.org Bounties

1 Like

@skilesare
I was hoping one like this would come through.
Ill happily do this one, if approved
Honestly there were a few of the new bounties that seemed very cool to work on but I have something that needs xml and might as well start here

Previous Bounties Completed: Candid and CBOR parsers

1 Like

I’ll ask the board for approval, but as a past awardee, I’d say go get started. :slight_smile:

1 Like

Any updates on this one?

@skilesare does this mean i was approved?
Nothing runnable yet, but i have some basic functionally implemented. Was waiting for official approval before i made this a priority

You are good to go. :slight_smile:

1 Like

Just as an update. Have basic functionality down and have a few tests in place to take xml text/bytes and tokenize them, then take those tokens and parse an xml document.

Remaining:

  • Refining the API
  • Check against XML spec
  • Fix misc bugs
  • Convert an xml doc to xml string/bytes
  • Documentation

as of now here are the types that i am using.
Types:


    public type Document = {
        version : ?Version;
        encoding : ?Text;
        standalone : ?Bool;
        root : Element;
        processInstructions : [ProcessingInstruction];
    };

    public type ProcessingInstruction = {
        target : Text;
        attributes : [Attribute];
    };

    public type ElementChild = {
        #element : Element;
        #text : Text;
        #comment : Text;
    };

    public type Element = {
        name : Text;
        attributes : [Attribute];
        children : {
            #selfClosing;
            #open : [ElementChild];
        };
    };

    public type Attribute = {
        name : Text;
        value : ?Text;
    };

    public type TagInfo = {
        name : Text;
        attributes : [Attribute];
    };

    public type Version = { major : Nat; minor : Nat };

    public type XmlDeclaration = {
        version : Version;
        encoding : ?Text;
        standalone : ?Bool;
    };

    public type Token = {
        #startTag : TagInfo and { selfClosing : Bool };
        #endTag : { name : Text };
        #text : Text;
        #comment : Text;
        #xmlDeclaration : XmlDeclaration;
        #processingInstruction : ProcessingInstruction;
    };

Update
Been adding more tests and refining the API
Now return error messages instead of just null value

TODO
Handle escaped characters like <, >
Handle <![CDATA[ ]]>
Case sensitivity
Number parsing

@skilesare know of any good libraries for doing Text -> Nat parsing or a Text library that would have a toLower function. If not I might just write my own things

1 Like

Update
Have handled escaping but I have decided to create another repo motoko_text to help out with some of this text parsing for optimization and use in other projects. Also ive udpated the motoko_numbers repo with some number parsing options to use because I couldnt find much out there that helped with what i needed

TODO
Integrating the motoko_text
Figuring out what to do about complex XML elements used mainly for validation and substitution. My current plan is to have a ‘non-validating’ xml parser but have the elements parsed for validation to happen, but Ill get more concrete once I figure out parsing them

UPDATE
XML is much more complicated than JSON

Been working through the final parts of the ‘DocType’ parsing which includes a lot of validation info and parameters and such. Most of this will just be parsed since the plan so far is to have a ‘non-validating’ xml parser. Validation can always be added on top. At the bottom is the structure for the DocType just to get an idea

TODO
Encoder
Final parsing tests for doctype
Documentation
Review

Question @skilesare

  1. Since this is not technically a serialization library im not sure how to handle a few things.
    A) Part of me wants to make this a ‘pure’ parsing library where if there are any elements in the xml, they get parsed as is with no modifications, such as a tag vs text. CDATA contains raw text that can include special characters such as &, <, > without the need to escape them, vs text needs to have the text unescaped. I was treating them as the same but we lose certain information about the parsed XML if we do. BUT if we dont then the user has to handle it, which is not ideal.
    B) A lot of information is returned that is never going to be used. I see this library being used as a deserialization helper, which means that no one will care about anything but the root element property that is returned out of:
    public type Document = {
        version : ?Version;
        encoding : ?Text;
        standalone : ?Bool;
        root : Element;
        docType : ?DocType;
        processInstructions : [ProcessingInstruction];
    };

I feel like maybe I need to add another thing on top of the XML to just return the root element and have all the text unescaped and all the parameters replaced. So have both and people can use what they need, like decodeXML which returns all the XML info and decode which would handle all the complexities and just return the root element that has been modified to be easy to use?

C) Adding onto point B, I would guess that when people use this that either the XML is parsed, or it should trap. Im still getting used to returning #ok/#error but sometimes its just annoying to deal with, using switch statements everytime one wants to call something that could have an issue. Rust handles this better but is it bad practice to have decode and decodeOrTrap?

DocType:



    public type ChildElementKind = {
        #sequence : [ChildElement];
        #choice : [ChildElement];
        #element : Text;
    };

    public type Ocurrance = {
        #one;
        #zeroOrOne;
        #zeroOrMore;
        #oneOrMore;
    };

    public type ChildElement = {
        kind : ChildElementKind;
        ocurrance : Ocurrance;
    };

    public type AllowableContents = {
        #any;
        #empty;
        #children : ChildElement;
        #mixed : ChildElement;
    };

    public type ElementTypeDefintion = {
        name : Text;
        allowableContents : AllowableContents;
    };

    public type AttributeType = {
        #cdata;
        #id;
        #idRef;
        #idRefs;
        #entity;
        #entities;
        #nmToken;
        #nmTokens;
        #notation;
        #enumeration : [Text];
    };

    public type AttributeTypeDefinition = {
        elementName : Text;
        name : Text;
        type_ : AttributeType;
        defaultValue : { #required; #implied; #fixed : Text };
    };

    public type GeneralEntityTypeDefinition = {
        name : Text;
        type_ : {
            #internal : {
                value : Text;
            };
            #external : {
                type_ : { #system_; #public_ : { id : Text } };
                url : Text;
                notationId : ?Text;
            };
        };
    };

    public type ParameterEntityTypeDefinition = {
        name : Text;
        type_ : {
            #internal : {
                value : Text;
            };
            #external : {
                type_ : { #system_; #public_ : { id : Text } };
                url : Text;
            };
        };
    };

    public type NotationTypeDefinition = {
        name : Text;
        type_ : {
            #system_ : { url : Text };
            #public_ : { id : Text; url : ?Text };
        };
    };

    public type InternalDocumentTypeDefinition = {
        #element : ElementTypeDefintion;
        #attribute : AttributeTypeDefinition;
        #generalEntity : GeneralEntityTypeDefinition;
        #parameterEntity : ParameterEntityTypeDefinition;
        #notation : NotationTypeDefinition;
        #processingInstruction : ProcessingInstruction;
        #comment : Text;
    };
    public type ExternalTypesDefinition = {
        #system_ : { url : Text };
        #public_ : { id : Text; url : Text };
    };

    public type DocumentTypeDefinition = {
        externalTypes : ?ExternalTypesDefinition;
        internalTypes : [InternalDocumentTypeDefinition];
    };

    public type DocType = {
        rootElementName : Text;
        typeDefinition : DocumentTypeDefinition;
    };

UPDATE
image

I have completed the XML document parsing with a bunch of tests, now Im moving onto the encoder and the thing mentioned above where ill make another function that will take the complicated XML doc and spit out the element hierarchy with everything unescaped and parameters replaced.

I thought that switching from binary to string parsing would be a better experience, but handling all these cases and whitespace is a bit much lol

Also using MOPS for testing is great. I don’t even use Vessel anymore, mops handles it all

1 Like

UPDATE
Now have the ‘simple’/processed root element decoding. It takes the document, replaces any entity values and handles escaped values. Here is what it looks like

    public func decode(bytes : Blob) : Result<Element.Element>;

    public type Element = {
        name : Text;
        attributes : [Attribute];
        children : [ElementChild];
    };

    public type ElementChild = {
        #element : Element;
        #text : Text;
    };

    public type Attribute = {
        name : Text;
        value : ?Text;
    };

TODO
Encoder
Docs

@skilesare
I am officially done with everything and only would need changes based on review

Have tests hitting all cases I can think of

The README is populated and the Xml functions are decorated with doc comments

Here is what the ‘exposed’ API looks like

    public func deserializeFromBytes(xmlBytes : Iter.Iter<Nat8>) : Result<Element.Element>;

    public func deserialize(xml : Iter.Iter<Char>) : Result<Element.Element>;

    public func serializeToBytes(root : Element.Element) : Iter.Iter<Nat8>;

    public func serialize(root : Element.Element) : Iter.Iter<Char>;

With types:

    public type Element = {
        name : Text;
        attributes : [Attribute];
        children : ElementChildren;
    };

    public type ElementChildren = {
        #selfClosing;
        #open : [ElementChild];
    };

    public type ElementChild = {
        #element : Element;
        #text : Text;
    };

    public type Attribute = {
        name : Text;
        value : ?Text;
    };

Under the hood there is still all the complexity so I can parse a full XML document with all the complexities and entities to use in the final simple form. I can later expose some of that complexity if needed but I don’t see a lot of use cases right now for it, i think its more confusing than anything, but maybe Im wrong about that.

Github: GitHub - edjCase/motoko_xml: XML encoding/decoding library in motoko
MOPS: Internet Computer Content Validation Bootstrap

2 Likes

Also added a small library for text that ill be adding onto that I referenced earlier

3 Likes

@skilesare
I went ahead and pushed the initial version to MOPS
Any other thoughts on the library?

@skilesare
Any update on this?
Looks like its in review

Just waiting for the dfinity payment.

1 Like