Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing string values and treating as dates is incorrect #1300

Closed
pzhelnov opened this issue Oct 5, 2018 · 13 comments
Closed

Parsing string values and treating as dates is incorrect #1300

pzhelnov opened this issue Oct 5, 2018 · 13 comments

Comments

@pzhelnov
Copy link

pzhelnov commented Oct 5, 2018

Hi!
There is a CSV file, which I try to read and it contains field with value

"Aprobil P 0.1%"

the short example of CSV:


"Country";"Product Family"
"Germany";"Aprobil P 0.1%"

conversion to workbook is the following:

var workbook = XLSX.read(csvData, {
                    type:'string',
                    dateNF: 'D-M-YYYY',
                    cellDates:true,
                    cellText:true,
                    cellNF: false,
                raw:false});

after conversion I save the XLS, where value "Aprobil P 0.1%" is converted to a date 01.04.00

looking into the worksheet model and getting the certain cell, it contains:

{
      t: 'd',
      v: 'Sat Apr 01 2000 00:00:00 GMT+0300 (Eastern European Summer Time)',
      z:undefined
}

is there any way to cover this case?
Could you pls assist, what to do.

Thanks in advance!

@pzhelnov
Copy link
Author

any update, any comments? is anybody here? ;)

@metal4timmy
Copy link

I am having a similar issue with certain strings being converted to date values. Any update on this?

@pzhelnov
Copy link
Author

guys, @SheetJSDev, pls, just pay a little attention to it, tell us, is it possible to fix, or should we rely on our workaround?

@pzhelnov
Copy link
Author

pzhelnov commented Sep 11, 2019

guys, one year has been passed, no one even commented.
Do I need to switch to another project?

@SheetJSDev
Copy link
Contributor

Hello @pzhelnov ! First, understand that this is an open source project and all work done on the project is charity. If you are concerned about response time or interested in supporting development, we offer paid builds and support as discussed in https://sheetjs.com/pro

That said, unfortunately there's no magic solution here. The Date parsing currently falls back on the JS Date builtin features, which are super flexible in chrome. As an example:

new Date("This is not a date 1")

is the date January 1 2001, despite it clearly indicating that it isn't a date.

Setting the option raw: true will give you the raw strings, suppressing any sort of value interpretation.

If you are interested in helping, the relevant function is fuzzydate. Google Docs / Office Online likely use a series of regular expressions based on some known formats like /\d{4}-\d{2}-\d{2}/ for yyyy-mm-dd and that might be a more sensible approach than using the generic Date constructor

@pzhelnov
Copy link
Author

Thank you very much for an answer!
Sorry for raising my hand in such a rude manner, we've just stuck on this problem for a long time and my workaround approach was not so strict.

First of all I have to say this is great project, one of the best, flexible and configurable, which currently existing in the community.
So, keep going!

@SheetJSDev
Copy link
Contributor

We'd accept a PR that does the following:

  1. add an option to the reader, call it something like strictDates (use your imagination) that would use this alternative parser

  2. change the line https://github.com/SheetJS/js-xlsx/blob/master/bits/40_harb.js#L834 to

			else if(!isNaN(fuzzydate(s, o.strictDates).getDate()) || _re && s.match(_re)) {
  1. add a second parameter to fuzzydate and load up the regexes in https://github.com/SheetJS/js-xlsx/blob/master/bits/20_jsutils.js#L126:
var STRICT_DATE_REGEXES = [
	/\d{4}-\d{2}-\d{2}/
]
function fuzzydate(s/*:string*/, strict/*:boolean*/)/*:Date*/ {
	var o = new Date(s), n = new Date(NaN);
	if(strict) {
		for(var i = 0; i < STRICT_DATE_REGEXES.length; ++i) if(s.match(STRICT_DATE_REGEXES[i])) return o;
		return n;
	}
	var y = o.getYear(), m = o.getMonth(), d = o.getDate();
	if(isNaN(d)) return n;
	if(y < 0 || y > 8099) return n;
	if((m > 0 || d > 1) && y != 101) return o;
	if(s.toLowerCase().match(/jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec/)) return o;
	if(s.match(/[^-0-9:,\/\\]/)) return n;
	return o;
}

This obviously isn't 100% correct but is roughly shaped like how we'd expect the ultimate code to be. The new code doesn't the existing behavior (so we could get away with a patch version bump) but it does give you a backdoor to solve your immediate problem and gives us room to grow as we grind further

@nandanv2702
Copy link

Hey, is this issue still open to a PR? I'd like to give it a shot if I can

@SheetJSDev
Copy link
Contributor

As is usually the case with these problems, the code is the easy part. The hard part is making a decision :(

If this is of interest, start by enumerating the common date formats that the parser should recognize (for example, YYYY-MM-DD) and constructing the corresponding regular expressions. Then we'd accept a PR that follows the approach outlined in the previous comment.

@nandanv2702
Copy link

Hmm, I see - let me start working on this in a bit and I'll see if I can add something of value that works. It's also my first open-source contrib, so please lmk if there's something more I can do!

@nandanv2702
Copy link

Hey! Just created a PR for this here - let me know if I need to make any more changes, etc.

@SheetJSDev
Copy link
Contributor

Based on some testing, in the en-US locale, there appear to be no valid Excel dates that contain letters outside of the month short or long names (even full day names are not accepted). Looking at a number of the issues, a guard like

    var lower = s.toLowerCase();
    if(lower.match(/jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec/)) {
      lower = lower.replace(/[^a-z]/g,"");
      if(lower.length > 3 && lower_months.indexOf(lower) == -1) return n;
      return o;
    }

(where lower_months is an array of the long names of months) would address many of the problems. The proposed "strict" mode is still needed for determining field values and matching formats that Chrome does not accept (for example, 1:23.04:56)

@SheetJSDev
Copy link
Contributor

Fixed in 0.18.3: https://jsfiddle.net/9u84Lwja/

const csvData = `\
"Country";"Product Family"
"Germany";"Aprobil P 0.1%"`;
var workbook = XLSX.read(csvData, {
                    type:'string',
                    dateNF: 'D-M-YYYY',
                    cellDates:true,
                    cellText:true,
                    cellNF: false,
                raw:false});

out.innerText = JSON.stringify(workbook.Sheets.Sheet1, 2, 2);

The relevant cell is

  "B2": {
    "t": "s",
    "v": "Aprobil P 0.1%"
  },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants